I am trying to remove some text from a string. What I want to remove could be any of the examples listed below. Basically any combination of uppercase and lowercase, any combination of integers at the end, and any combination of letters at the end. There could also be a space between or not.
(Disk 1)
(Disk 5)
(Disc2)
(Disk 10)
(Part A)
(Pt B)
(disk a)
(CD 7)
(cD X)
I have a method already to get the beginning "(type"
multi_disk_search = [ '(disk', '(disc', '(part', '(pt', '(prt' ]
if any(mds in fileName.lower() for mds in multi_disk_search): #https://stackoverflow.com/a/3389611
for mds in multi_disk_search:
if mds in fileName.lower():
print(mds)
break
That returns (disc for example.
I cannot just split by the parenthesis because there could be other tags in other parenthesis. Also there is no specific order to the tags. The one I am searching for is typically last; however many times it is not.
I think the solution will require regex, but I'm really lost when it comes to that.
I tried this, but it returns something that doesn't make any sense to me.
regex = re.compile(r"\s*\%s\s*" % (mds), flags=re.I) #https://stackoverflow.com/a/20782251/11214013
regex.split(fileName)
newName = regex
print(newName)
Which returns re.compile('\\s*\\(disc\\s*', re.IGNORECASE)
What are some ways to solve this?
Perhaps something like this:
rx = re.compile(r'''
\(
(?: dis[ck] | p(?:a?r)?t )
[ ]?
(?: [a-z]+ | [0-9]+ )
\)''', re.I | re.X)
This pattern uses only basic syntax of regex pattern except eventually the X flag, the Verbose mode (with this one any blank character is ignored in the pattern except when it is escaped or inside a character class). Feel free to read the python manual about the re module. Adding support for CD is let as an exercise.
>>> import re
>>> def remove_parens(s,multi_disk_search):
... mds = '|'.join([re.escape(x) for x in multi_disk_search])
... return re.sub(f'\((?:{mds})[0-9A-Za-z ]*\)','',s,0,re.I)
...
>>> multi_disk_search = ['disk','cd','disc','part','pt']
>>> remove_parens('this is a (disc a) string with (123xyz) parens removed',multi_disk_search)
'this is a string with (123xyz) parens removed'
In this post I am calling strings of the form (x1, ..., xn) a sequence and strings of the form {y1, ..., yn} a set. Where each xi and yi can be either a number [0-9]+, a word [a-zA-Z]+, a number/word [a-zA-Z0-9]+, a sequence, or a set.
I want to know if it is at all possible (and if so, help figuring out how) to use Regex to deal with the following:
I want to transform sequences of the form (x1, ..., xn, xn+1) into ((x1, ..., xn), xn+1).
Examples:
(1,2,3,4,5) will change to ((1,2,3,4),5)
((1,2,3,4),5) will change to (((1,2,3),4),5) since the only string of the form (x1, ..., xn, xn+1) is the (1,2,3,4) on the inside.
(((1,2,3),4),5) will change to ((((1,2),3),4),5) since the only string of the form (x1, ..., xn, xn+1) is the (1,2,3) on the inside.
(1,(2,3),4) will change to ((1,(2,3)),4)
({1},{2},{3}) will change to (({1},{2}),{3})
As per request, more examples:
((1,2),(3,4),5) will change to (((1,2),(3,4)),5)
((1,2),3,(4,5)) will change to (((1,2),3),(4,5))
(1,(2,(3,4),5)) will change to (1,((2,(3,4)),5)) since the only sequence of the form (x1, ..., xn, xn+1) is the (2,(3,4),5) on the inside.
Here is what I have so far:
re.sub(r'([(][{}a-zA-Z0-9,()]+,[{}a-zA-Z0-9,]+),', r'(\g<1>),', string)
You can see the strings it works for here.
It is not working for (1,(2,3),4), and it is working on things like ({{x},{x,y}},z) when it shouldn't be. Any help is greatly appreciated; I feel like it is possible to get this to work in Regex but there seem to be a lot of special cases that require the Regex to be very precise.
For any recursion in python, you'll have to use the PyPi regex module (as mentioned by anubhava in his answer). You can use the following pattern:
See regex in use here
(\((({(?:[^{}]+|(?-1))+})|\((?:[^(),]+,[^(),]+)\)|[^(){},]+)(?:,(?2))+),
Replace with (\1),
How it works:
(\((({(?:[^{}]+|(?-1))+})|\((?:[^(),]+,[^(),]+)\)|[^(){},]+)(?:,(?2))+), capture the following into capture group 1, then match ,
\( match ( literally
(({(?:[^{}]+|(?-1))+})|\((?:[^(),]+,[^(),]+)\)|[^(){},]+) match one of the following options one or more times (and capture it into capture group 2)
({(?:[^{}]+|(?-1))+}) capture the following into capture group 3
{(?:[^{}]+|(?-1))+} matches {, then one of the following options one or more times, then }
[^{}]+ matches any character except { or } one or more times
(?-1) recurses the previous capture group (capture group 3)
\((?:[^(),]+,[^(),]+)\) matches (, then the following, then )
[^(),]+ matches any character except (, ), or , one or more times
, matches the comma , character literally
[^(),]+ matches any character except (, ), or , one or more times
[^(){},]+ matches any character except (, ), {, }, or , one or more times
(?:,(?2))+ matches the following one or more times
,(?2) matches ,, then recurses capture group 2
In simpler terms, capture group 2 defines what a term is. It...:
Matches any sets {y1, ..., yn} + recusively: {{y1, ..., yn},..., xn}
Matches any complete sequences of exactly two elements: (x1, x2)
Matches any string object (numbers, words, etc.) 1, 2, ... x, y, ...
Then capture group 1 uses the well-defined terms from capture group 2 to match as many terms as possible, with the string containing at least two terms and a comma (x,x, with as many x, as possible). The replacement takes this capture group, encases it in () and appends ,. So in the case of (x,x,x,x), we get ((x,x,x),x).
Edit
By making the non-capture group possessive (?:[^{}]+|(?-1))++ (prevents backtracking) and changing the order of the options (most prevalent first), we can improve the efficiency of the pattern (764 -> 662 steps):
See regex in use here
(\(([^(){},]+|\((?:[^(),]+,[^(),]+)\)|({(?:[^{}]+|(?-1))++}))(?:,(?2))+),
If you can consider using PyPi regex module of Python that supports PCRE features then it is possible with recursive matching support using this regex:
/
( # start capture group #1
\( # match left (
(?<el> # start named group el
( { (?: [^{}]*+ | (?-1) )* } ) | # Match {...} text OR
( \( (?: [^()]*+ | (?-1) )*+ \) ) | # Match (...) text OR
\w+ # Match 1+ word characters
) # End named group el
(?: , (?&el) )+ # Match comma followed by recursion of 'el'
# Match this group it 1+ times
) , # End capture group #1
/ x # Enable extended mode in regex
RegEx Demo
Here's a simple parser built with ply
It is certainly less compact than the regex solutions, but it has a couple of considerable advantages:
It's a lot easier to write and understand. (In fact, my first attempt worked perfectly except for a typo in one of the names.) Moreover, it is reasonably clear by examination exactly what syntax is being parsed. (This assumes some minimal understanding of generative grammars, of course, but the concepts are not particularly difficult and there are many available learning resources.)
If you want to add more features in the future, it's straight-forward to modify. If instead of just reformatting the text, you want to actually make some use of the decomposed structure, that is easily available without much effort.
As with most generated parsers, it has two components: a lexer (or scanner), which decomposes the input into tokens and discards unimportant text such as whitespace, and a parser which analyses the stream of tokens in order to figure out its structure. Normally a parser would construct some kind of data structure representing the input, normally some kind of tree. Here I've simplified the process by just recombining the parsed input into a transformed output. (In retrospect, I can't help thinking that, as usual, it would have been clearer to produce an entire parse tree and then create the output by doing a walk over the tree. Perhaps as penance I'll redo it later.)
Here's the scanner. The only meaningful tokens are the punctuation symbols and what I've called WORDs, which are sequences of whatever Python considers word characters (usually alphabetic and numeric characters plus underlines), without distinguishing between purely alphabetic, purely numeric, and mixed tokens as in your question.
import ply.lex as lex
tokens = [ "WORD" ]
t_WORD = r"\w+"
# Punctuation
literals = "{}(),"
# Ignore whitespace
t_ignore = " \r\n\t"
# Anything else is an error
def t_error(t):
print("Illegal character %s" % repr(t.value[0]))
t.lexer.skip(1)
# Build the lexer
lexer = lex.lex()
Now the parser. The grammar for sequences is a little redundant because it has to special-case a sequence of one item: Since the grammar also explicitly inserts parentheses around A,B as it parses, it would be incorrect to add them around the entire sequence. But if the entire sequence is one item, the original parenthweses have to be reinserted. For sets, things are much clearer; the elements are not modified at all, the braces must always be added back.
Here's the entire grammar:
# scalar : WORD | set | sequence
# sequence : '(' scalar ')'
# | '(' seqlist ')'
# seqlist : scalar ',' scalar
# | seqlist ',' scalar
# set : '{' setlist '}'
# setlist : scalar
# | setlist ',' scalar
And here's the implementation, with the grammar repeated, Ply-style, as docstrings:
import ply.yacc as yacc
start = 'scalar'
def p_unit(p):
"""scalar : WORD
| set
| sequence
setlist : scalar
"""
p[0] = p[1]
def p_sequence_1(p):
"""sequence : '(' scalar ')'
"""
p[0] = '(%s)' % p[2]
def p_sequence(p):
"""sequence : '(' seqlist ')'
"""
p[0] = p[2]
def p_seqlist(p):
"""seqlist : scalar ',' scalar
| seqlist ',' scalar
"""
p[0] = "(%s,%s)" % (p[1], p[3])
def p_set(p):
"""set : '{' setlist '}'
"""
p[0] = '{%s}' % p[2]
def p_setlist(p):
"""setlist : setlist ',' scalar
"""
p[0] = "%s,%s" % (p[1], p[3])
def p_error(p):
if p:
print("Syntax error at token", p.type)
else:
print("Syntax error at EOF")
parser = yacc.yacc()
Now, a (very) simple driver:
import readline
while True:
try:
s = input('> ')
except EOFError:
break
if s:
print(parser.parse(s, lexer=lexer))
I'm sure this will be easy pickings for more experienced programmers than I, but this problem is bedeviling me and I've made a couple of failed attempts, so I wanted to see what other people might come up with.
I have about a hundred strings that look something like this:
(argument1 OR argument2) | inputlookup my_lookup.csv | `macro1(tag,bunit)` | `macro2(category)` | `macro_3(tag,\"expected\",category)` | `macro4(tag,\"timesync\")`
The goal is to find the arguments to the macro function and replace them with the count of the arguments, so that the final output looks like this:
(argument1 OR argument2) | inputlookup my_lookup.csv | `macro1(2)` | `macro2(1)` | `macro_3(3)` | `macro4(2)`
Python has ways of obtaining the count I need (I was simply counting up the number of commas in a string and adding 1), and Python has plenty of regex-type solutions for inline string replacement, but for the life of me I can't figure out how to combine them.
It seems something like re.sub won't let me identify a substring, count the number of commas in the substring, and then replace the substring with that value (unless I am missing something in the docs).
Can anybody think of a way to do this? Have I missed something obvious?
Solution:
import re
def count_commas(input_str):
c = 0
for s in input_str:
if s == ',':
c += 1
return c
pattern = r'\([A-Za-z0-9,""]+\)'
original_str = '(argument1 OR argument2) | inputlookup my_lookup.csv | `macro1(tag,bunit)` | `macro2(category)` | `macro_3(tag,\"expected\",category)` | `macro4(tag,\"timesync\")`'
matches = re.findall(pattern, original_str)
for match in matches:
comma_count = count_commas(match) + 1
match = match.replace('(', '\(').replace(')', '\)')
original_str = re.sub(r'' + match, '(' + str(comma_count) + ')', original_str)
print (original_str)
Explanation:
pattern : "\([A-Za-z0-9,""]+\)" - backslashes to escape the special characters '(' and ')' in regex, and then I am looking for alphanumeric, comma and quotations (in the square-brackets) which is followed with '+' which means one or more than one repetition of such symbols in the square brackets.
matches : list of all the matches found. Eg - (tag,bunit)
Then, I am looping over all the matches to find the number of commas in the match, followed by replacing the '(' with '\(' and ')' with '\)' so as to escape in regex.
Finally, in the last line of the loop, I am using re.sub to replace the matched string with the comma count in the original string.
The Setup:
Let's say I have the following regex defined in my script. I want to keep the comments there for future me because I'm quite forgetful.
RE_TEST = re.compile(r"""[0-9] # 1 Number
[A-Z] # 1 Uppercase Letter
[a-y] # 1 lowercase, but not z
z # gotta have z...
""",
re.VERBOSE)
print(magic_function(RE_TEST)) # returns: "[0-9][A-Z][a-y]z"
The Question:
Does Python (3.4+) have a way to convert that to the simple string "[0-9][A-Z][a-y]z"?
Possible Solutions:
This question ("strip a verbose python regex") seems to be pretty close to what I'm asking for and it was answered. But that was a few years ago, so I'm wondering if a new (preferably built-in) solution has been found.
In addition to the above, there are work-arounds such as using implicit string concatenation and then using the .pattern attribute:
RE_TEST = re.compile(r"[0-9]" # 1 Number
r"[A-Z]" # 1 Uppercase Letter
r"[a-y]" # 1 lowercase, but not z
r"z", # gotta have z...
re.VERBOSE)
print(RE_TEST.pattern) # returns: "[0-9][A-Z][a-y]z"
or just commenting the pattern separately and not compiling it:
# matches pattern "nXxz"
RE_TEST = "[0-9][A-Z][a-y]z"
print(RE_TEST)
But I'd really like to keep the compiled regex the way it is (1st example). Perhaps I'm pulling the regex string from some file, and that file is already using the verbose form.
Background
I'm asking because I want to suggest an edit to the unittest module.
Right now, if you run assertRegex(string, pattern) using a compiled pattern with comments and that assertion fails, then the printed output is somewhat ugly (the below is a dummy regex):
Traceback (most recent call last):
File "verify_yaml.py", line 113, in test_verify_mask_names
self.assertRegex(mask, RE_MASK)
AssertionError: Regex didn't match: '(X[1-9]X[0-9]{2}) # comment\n |(XXX[0-9]{2}) # comment\n |(XXXX[0-9E]) # comment\n |(XXXX[O1-9]) # c
omment\n |(XXX[0-9][0-9]) # comment\n |(XXXX[
1-9]) # comment\n ' not found in 'string'
I'm going to propse that the assertRegex and assertNotRegex methods clean the regex before printing it by either removing the comments and extra whitespace or by printing it differently.
The following tested script includes a function that does a pretty good job converting an xmode regex string to non-xmode:
pcre_detidy(retext)
# Function pcre_detidy to convert xmode regex string to non-xmode.
# Rev: 20160225_1800
import re
def detidy_cb(m):
if m.group(2): return m.group(2)
if m.group(3): return m.group(3)
return ""
def pcre_detidy(retext):
decomment = re.compile(r"""(?#!py/mx decomment Rev:20160225_1800)
# Discard whitespace, comments and the escapes of escaped spaces and hashes.
( (?: \s+ # Either g1of3 $1: Stuff to discard (3 types). Either ws,
| \#.* # or comments,
| \\(?=[\r\n]|$) # or lone escape at EOL/EOS.
)+ # End one or more from 3 discardables.
) # End $1: Stuff to discard.
| ( [^\[(\s#\\]+ # Or g2of3 $2: Stuff to keep. Either non-[(\s# \\.
| \\[^# Q\r\n] # Or escaped-anything-but: hash, space, Q or EOL.
| \( # Or an open parentheses, optionally
(?:\?\#[^)]*(?:\)|$))? # starting a (?# Comment group).
| \[\^?\]? [^\[\]\\]* # Or Character class. Allow unescaped ] if first char.
(?:\\[^Q][^\[\]\\]*)* # {normal*} Zero or more non-[], non-escaped-Q.
(?: # Begin unrolling loop {((special1|2) normal*)*}.
(?: \[(?::\^?\w+:\])? # Either special1: "[", optional [:POSIX:] char class.
| \\Q [^\\]* # Or special2: \Q..\E literal text. Begin with \Q.
(?:\\(?!E)[^\\]*)* # \Q..\E contents - everything up to \E.
(?:\\E|$) # \Q..\E literal text ends with \E or EOL.
) [^\[\]\\]* # End special: One of 2 alternatives {(special1|2)}.
(?:\\[^Q][^\[\]\\]*)* # More {normal*} Zero or more non-[], non-escaped-Q.
)* (?:\]|\\?$) # End character class with ']' or EOL (or \\EOL).
| \\Q [^\\]* # Or \Q..\E literal text start delimiter.
(?:\\(?!E)[^\\]*)* # \Q..\E contents - everything up to \E.
(?:\\E|$) # \Q..\E literal text ends with \E or EOL.
) # End $2: Stuff to keep.
| \\([# ]) # Or g3of3 $6: Escaped-[hash|space], discard the escape.
""", re.VERBOSE | re.MULTILINE)
return re.sub(decomment, detidy_cb, retext)
test_text = r"""
[0-9] # 1 Number
[A-Z] # 1 Uppercase Letter
[a-y] # 1 lowercase, but not z
z # gotta have z...
"""
print(pcre_detidy(test_text))
This function detidies regexes written in pcre-8/pcre2-10 xmode syntax.
It preserves whitespace inside [character classes], (?#comment groups) and \Q...\E literal text spans.
RegexTidy
The above decomment regex, is a variant of one I am using in my upcoming, yet to be released: RegexTidy application, which will not only detidy a regex as shown above (which is pretty easy to do), but it will also go the other way and Tidy a regex - i.e. convert it from non-xmode regex to xmode syntax, adding whitespace indentation to nested groups as well as adding comments (which is harder).
p.s. Before giving this answer a downvote on general principle because it uses a regex longer than a couple lines, please add a comment describing one example which is not handled correctly. Cheers!
Looking through the way sre_parse handles this, there really isn't any point where your verbose regex gets "converted" into a regular one and then parsed. Rather, your verbose regex is being fed directly to the parser, where the presence of the VERBOSE flag makes it ignore unescaped whitespace outside character classes, and from unescaped # to end-of-line if it is not inside a character class or a capture group (which is missing from the docs).
The outcome of parsing your verbose regex there is not "[0-9][A-Z][a-y]z". Rather it is:
[(IN, [(RANGE, (48, 57))]), (IN, [(RANGE, (65, 90))]), (IN, [(RANGE, (97, 121))]), (LITERAL, 122)]
In order to do a proper job of converting your verbose regex to "[0-9][A-Z][a-y]z" you could parse it yourself. You could do this with a library like pyparsing. The other answer linked in your question uses regex, which will generally not duplicate the behavior correctly (specifically, spaces inside character classes and # inside capture groups/character classes. And even just dealing with escaping is not as convenient as with a good parser.)