String manipulation in Regex with seemingly too many exceptions

String manipulation in Regex with seemingly too many exceptions - python

In this post I am calling strings of the form (x1, ..., xn) a sequence and strings of the form {y1, ..., yn} a set. Where each xi and yi can be either a number [0-9]+, a word [a-zA-Z]+, a number/word [a-zA-Z0-9]+, a sequence, or a set.
I want to know if it is at all possible (and if so, help figuring out how) to use Regex to deal with the following:
I want to transform sequences of the form (x1, ..., xn, xn+1) into ((x1, ..., xn), xn+1).
Examples:
(1,2,3,4,5) will change to ((1,2,3,4),5)
((1,2,3,4),5) will change to (((1,2,3),4),5) since the only string of the form (x1, ..., xn, xn+1) is the (1,2,3,4) on the inside.
(((1,2,3),4),5) will change to ((((1,2),3),4),5) since the only string of the form (x1, ..., xn, xn+1) is the (1,2,3) on the inside.
(1,(2,3),4) will change to ((1,(2,3)),4)
({1},{2},{3}) will change to (({1},{2}),{3})
As per request, more examples:
((1,2),(3,4),5) will change to (((1,2),(3,4)),5)
((1,2),3,(4,5)) will change to (((1,2),3),(4,5))
(1,(2,(3,4),5)) will change to (1,((2,(3,4)),5)) since the only sequence of the form (x1, ..., xn, xn+1) is the (2,(3,4),5) on the inside.
Here is what I have so far:
re.sub(r'([(][{}a-zA-Z0-9,()]+,[{}a-zA-Z0-9,]+),', r'(\g<1>),', string)
You can see the strings it works for here.
It is not working for (1,(2,3),4), and it is working on things like ({{x},{x,y}},z) when it shouldn't be. Any help is greatly appreciated; I feel like it is possible to get this to work in Regex but there seem to be a lot of special cases that require the Regex to be very precise.

For any recursion in python, you'll have to use the PyPi regex module (as mentioned by anubhava in his answer). You can use the following pattern:
See regex in use here
(\((({(?:[^{}]+|(?-1))+})|\((?:[^(),]+,[^(),]+)\)|[^(){},]+)(?:,(?2))+),
Replace with (\1),
How it works:
(\((({(?:[^{}]+|(?-1))+})|\((?:[^(),]+,[^(),]+)\)|[^(){},]+)(?:,(?2))+), capture the following into capture group 1, then match ,
\( match ( literally
(({(?:[^{}]+|(?-1))+})|\((?:[^(),]+,[^(),]+)\)|[^(){},]+) match one of the following options one or more times (and capture it into capture group 2)
({(?:[^{}]+|(?-1))+}) capture the following into capture group 3
{(?:[^{}]+|(?-1))+} matches {, then one of the following options one or more times, then }
[^{}]+ matches any character except { or } one or more times
(?-1) recurses the previous capture group (capture group 3)
\((?:[^(),]+,[^(),]+)\) matches (, then the following, then )
[^(),]+ matches any character except (, ), or , one or more times
, matches the comma , character literally
[^(),]+ matches any character except (, ), or , one or more times
[^(){},]+ matches any character except (, ), {, }, or , one or more times
(?:,(?2))+ matches the following one or more times
,(?2) matches ,, then recurses capture group 2
In simpler terms, capture group 2 defines what a term is. It...:
Matches any sets {y1, ..., yn} + recusively: {{y1, ..., yn},..., xn}
Matches any complete sequences of exactly two elements: (x1, x2)
Matches any string object (numbers, words, etc.) 1, 2, ... x, y, ...
Then capture group 1 uses the well-defined terms from capture group 2 to match as many terms as possible, with the string containing at least two terms and a comma (x,x, with as many x, as possible). The replacement takes this capture group, encases it in () and appends ,. So in the case of (x,x,x,x), we get ((x,x,x),x).
Edit
By making the non-capture group possessive (?:[^{}]+|(?-1))++ (prevents backtracking) and changing the order of the options (most prevalent first), we can improve the efficiency of the pattern (764 -> 662 steps):
See regex in use here
(\(([^(){},]+|\((?:[^(),]+,[^(),]+)\)|({(?:[^{}]+|(?-1))++}))(?:,(?2))+),

If you can consider using PyPi regex module of Python that supports PCRE features then it is possible with recursive matching support using this regex:
/
( # start capture group #1
\( # match left (
(?<el> # start named group el
( { (?: [^{}]*+ | (?-1) )* } ) | # Match {...} text OR
( \( (?: [^()]*+ | (?-1) )*+ \) ) | # Match (...) text OR
\w+ # Match 1+ word characters
) # End named group el
(?: , (?&el) )+ # Match comma followed by recursion of 'el'
# Match this group it 1+ times
) , # End capture group #1
/ x # Enable extended mode in regex
RegEx Demo

Here's a simple parser built with ply
It is certainly less compact than the regex solutions, but it has a couple of considerable advantages:
It's a lot easier to write and understand. (In fact, my first attempt worked perfectly except for a typo in one of the names.) Moreover, it is reasonably clear by examination exactly what syntax is being parsed. (This assumes some minimal understanding of generative grammars, of course, but the concepts are not particularly difficult and there are many available learning resources.)
If you want to add more features in the future, it's straight-forward to modify. If instead of just reformatting the text, you want to actually make some use of the decomposed structure, that is easily available without much effort.
As with most generated parsers, it has two components: a lexer (or scanner), which decomposes the input into tokens and discards unimportant text such as whitespace, and a parser which analyses the stream of tokens in order to figure out its structure. Normally a parser would construct some kind of data structure representing the input, normally some kind of tree. Here I've simplified the process by just recombining the parsed input into a transformed output. (In retrospect, I can't help thinking that, as usual, it would have been clearer to produce an entire parse tree and then create the output by doing a walk over the tree. Perhaps as penance I'll redo it later.)
Here's the scanner. The only meaningful tokens are the punctuation symbols and what I've called WORDs, which are sequences of whatever Python considers word characters (usually alphabetic and numeric characters plus underlines), without distinguishing between purely alphabetic, purely numeric, and mixed tokens as in your question.
import ply.lex as lex
tokens = [ "WORD" ]
t_WORD = r"\w+"
# Punctuation
literals = "{}(),"
# Ignore whitespace
t_ignore = " \r\n\t"
# Anything else is an error
def t_error(t):
print("Illegal character %s" % repr(t.value[0]))
t.lexer.skip(1)
# Build the lexer
lexer = lex.lex()
Now the parser. The grammar for sequences is a little redundant because it has to special-case a sequence of one item: Since the grammar also explicitly inserts parentheses around A,B as it parses, it would be incorrect to add them around the entire sequence. But if the entire sequence is one item, the original parenthweses have to be reinserted. For sets, things are much clearer; the elements are not modified at all, the braces must always be added back.
Here's the entire grammar:
# scalar : WORD | set | sequence
# sequence : '(' scalar ')'
# | '(' seqlist ')'
# seqlist : scalar ',' scalar
# | seqlist ',' scalar
# set : '{' setlist '}'
# setlist : scalar
# | setlist ',' scalar
And here's the implementation, with the grammar repeated, Ply-style, as docstrings:
import ply.yacc as yacc
start = 'scalar'
def p_unit(p):
"""scalar : WORD
| set
| sequence
setlist : scalar
"""
p[0] = p[1]
def p_sequence_1(p):
"""sequence : '(' scalar ')'
"""
p[0] = '(%s)' % p[2]
def p_sequence(p):
"""sequence : '(' seqlist ')'
"""
p[0] = p[2]
def p_seqlist(p):
"""seqlist : scalar ',' scalar
| seqlist ',' scalar
"""
p[0] = "(%s,%s)" % (p[1], p[3])
def p_set(p):
"""set : '{' setlist '}'
"""
p[0] = '{%s}' % p[2]
def p_setlist(p):
"""setlist : setlist ',' scalar
"""
p[0] = "%s,%s" % (p[1], p[3])
def p_error(p):
if p:
print("Syntax error at token", p.type)
else:
print("Syntax error at EOF")
parser = yacc.yacc()
Now, a (very) simple driver:
import readline
while True:
try:
s = input('> ')
except EOFError:
break
if s:
print(parser.parse(s, lexer=lexer))

Related

Python - string replace between parenthesis with wildcards

I am trying to remove some text from a string. What I want to remove could be any of the examples listed below. Basically any combination of uppercase and lowercase, any combination of integers at the end, and any combination of letters at the end. There could also be a space between or not.
(Disk 1)
(Disk 5)
(Disc2)
(Disk 10)
(Part A)
(Pt B)
(disk a)
(CD 7)
(cD X)
I have a method already to get the beginning "(type"
multi_disk_search = [ '(disk', '(disc', '(part', '(pt', '(prt' ]
if any(mds in fileName.lower() for mds in multi_disk_search): #https://stackoverflow.com/a/3389611
for mds in multi_disk_search:
if mds in fileName.lower():
print(mds)
break
That returns (disc for example.
I cannot just split by the parenthesis because there could be other tags in other parenthesis. Also there is no specific order to the tags. The one I am searching for is typically last; however many times it is not.
I think the solution will require regex, but I'm really lost when it comes to that.
I tried this, but it returns something that doesn't make any sense to me.
regex = re.compile(r"\s*\%s\s*" % (mds), flags=re.I) #https://stackoverflow.com/a/20782251/11214013
regex.split(fileName)
newName = regex
print(newName)
Which returns re.compile('\\s*\\(disc\\s*', re.IGNORECASE)
What are some ways to solve this?

Perhaps something like this:
rx = re.compile(r'''
\(
(?: dis[ck] | p(?:a?r)?t )
[ ]?
(?: [a-z]+ | [0-9]+ )
\)''', re.I | re.X)
This pattern uses only basic syntax of regex pattern except eventually the X flag, the Verbose mode (with this one any blank character is ignored in the pattern except when it is escaped or inside a character class). Feel free to read the python manual about the re module. Adding support for CD is let as an exercise.

>>> import re
>>> def remove_parens(s,multi_disk_search):
... mds = '|'.join([re.escape(x) for x in multi_disk_search])
... return re.sub(f'\((?:{mds})[0-9A-Za-z ]*\)','',s,0,re.I)
...
>>> multi_disk_search = ['disk','cd','disc','part','pt']
>>> remove_parens('this is a (disc a) string with (123xyz) parens removed',multi_disk_search)
'this is a string with (123xyz) parens removed'

Avoid special values or space between values using python re

For any phone number which allows () in the area code and any space between area code and the 4th number, I want to create a tuple of the 3 sets of numbers.
For example: (301) 556-9018 or (301)556-9018 would return ('301','556','9018').
I will raise a Value error exception if the input is anything other than the original format.
How do I avoid () characters and include either \s or none between the area code and the next values?
This is my foundation so far:
phonenum=re.compile('''([\d)]+)\s([\d]+) - ([\d]+)$''',re.VERBOSE).match('(123) 324244-123').groups()
print(phonenum)
Do I need to make a if then statement to ignore the () for the first tuple element, or is there a re expression that does that more efficiently?
In addition the \s in between the first 2 tuples doesn't work if it's (301)556-9018.
Any hints on how to approach this?

When specifying a regular expression, you should use raw-string mode:
`r'abc'` instead of `'abc'`
That said, right now you are capturing three sets of numbers in groups. To allow parens, you will need to match parens. (The parens you currently have are for the capturing groups.)
You can match parens by escaping them: \( and \)
You can find various solutions to "what is a regex for XXX" by seaching one of the many "regex libary" web sites. I was able to find this one via DuckDuckGo: http://www.regexlib.com/Search.aspx?k=phone
To make a part of your pattern optional, you can make the individual pieces optional, or you can provide alternatives with the piece present or absent.
Since the parens have to be present or absent together - that is, you don't want to allow an opening paren but no closing paren - you probably want to provide alternatives:
# number, no parens: 800 555-1212
noparens = r'\d{3}\s+\d{3}-\d{4}'
# number with parens: (800) 555-1212
yesparens = r'\(\d{3}\)\s*\d{3}-\d{4}'
You can match the three pieces by inserting "grouping parens":
noparens_grouped = r'(\d{3})\s+(\d{3})-(\d{4})'
yesparens_grouped = r'\((\d{3})\)\s*(\d{3})-(\d{4})'
Note that the quoted parens go outside of the grouping parens, so that the parens do not become part of the captured group.
You can join the alternatives together with the | operator:
yes_or_no_parens_groups = noparens_grouped + '|' + yesparens_grouped

In regular expressions you can use special characters to specify some behavior of some part of the expression.
From python re documentation:
'*' =
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
'+' =
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
'?' =
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
So to solve the blank space problem you can use either '?' if you know the occurrence will be no more than 1, or '+' if you can have more than 1.
In case of grouping information together and them returning a list, you can put your expression inside parenthesis and then use function groups() from re.
The result would be:
results = re.search('\((\d{3})\)\s?(\d{3})-(\d{4})', '(301) 556-9018')
if results:
print results.groups()
else:
print('Invalid phone number')

Splitting a string with delimiters and conditions

I'm trying to split a general string of chemical reactions delimited by whitespace, +, = where there may be an arbitrary number of whitespaces. This is the general case but I also need it to split conditionally on the parentheses characters () when there is a + found inside the ().
For example:
reaction= 'C5H6 + O = NC4H5 + CO + H'
Should be split such that the result is
splitresult=['C5H6','O','NC4H5','CO','H']
This case seems simple when using filter(None,re.split('[\s+=]',reaction)). But now comes the conditional splitting. Some reactions will have a (+M) which I'd also like to split off of as well leaving only the M. In this case, there will always be a +M inside the parentheses
For example:
reaction='C5H5 + H (+M)= C5H6 (+M)'
splitresult=['C5H5','H','M','C5H6','M']
However, there will be some cases where the parentheses will not be delimiters. In these cases, there will not be a +M but something else that doesn't matter.
For example:
reaction='C5H5 + HO2 = C5H5O(2,4) + OH'
splitresult=['C5H5','HO2','C5H5O(2,4)','OH']
My best guess is to use negative lookahead and lookbehind to match the +M but I'm not sure how to incorporate that into the regex expression I used above on the simple case. My intuition is to use something like filter(None,re.split('[(?<=M)\)\((?=\+)=+\s]',reaction)). Any help is much appreciated.

You could use re.findall() instead:
re.findall(pattern, string, flags=0)
Return all non-overlapping
matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found. If
one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result unless they touch the
beginning of another match.
then:
import re
reaction0= 'C5H6 + O = NC4H5 + CO + H'
reaction1='C5H5 + H (+M)= C5H6 (+M)'
reaction2='C5H5 + HO2 = C5H5O(2,4) + OH'
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction0)
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction1)
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction2)
but, if you prefer re.split() and filter(), then:
import re
reaction0= 'C5H6 + O = NC4H5 + CO + H'
reaction1='C5H5 + H (+M)= C5H6 (+M)'
reaction2='C5H5 + HO2 = C5H5O(2,4) + OH'
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction0))
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction1))
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction2))
the pattern for findall is different from the pattern for split,
because findall and split are looking for different things; 'the opposite things', indeed.
findall, is looking for that you wanna (keep it).
split, is looking for that you don't wanna (get rid of it).
In findall, '[A-Z0-9]+(?:([1-9],[1-9]))?'
match any upper case or number > [A-Z0-9],
one or more times > +, follow by a pair of numbers, with a comma in the middle, inside of parenthesis > \([1-9],[1-9]\)
(literal parenthesis outside of character classes, must be escaped with backslashes '\'), optionally > ?
\([1-9],[1-9]\) is inside of (?: ), and then,
the ? (which make it optional); ( ), instead of (?: ) works, but, in this case, (?: ) is better; (?: ) is a no capturing group: read about this.
try it with the regex in the split

That seems overly complicated to handle with a single regular expression to split the string. It'd be much easier to handle the special case of (+M) separately:
halfway = re.sub("\(\+M\)", "M", reaction)
result = filter(None, re.split('[\s+=]', halfway))

So here is the regex which you are looking for.
Regex: ((?=\(\+)\()|[\s+=]|((?<=M)\))
Flags used:
g for global search. Or use them as per your situation.
Explanation:
((?=\(\+)\() checks for a ( which is present if (+ is present. This covers the first part of your (+M) problem.
((?<=M)\)) checks for a ) which is present if M is preceded by ). This covers the second part of your (+M) problem.
[\s+=] checks for all the remaining whitespaces, + and =. This covers the last part of your problem.
Note: The care for digits being enclosed by () ensured by both positive lookahead and positive lookbehind assertions.
Check Regex101 demo for working
P.S: Make it suitable for yourself as I am not a python programmer yet.

Is this possible using regular expression

I am using Python 2.7 and I am fairly familiar with using regular expressions and how to use them in Python. I would like to use a regex to replace comma delimiters with a semicolon. The problem is that data wrapped in double qoutes should retain embedded commas. Here is an example:
Before:
"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
After:
"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"
Is there a single regex that can do this?

This is an other way that avoids to test all the string until the end with a lookahead for each occurrence. It's a kind of (more or less) \G feature emulation for re module.
Instead of testing what comes after the comma, this pattern find the item before the comma (and the comma obviously) and is written in a way that makes each whole match consecutive to the precedent.
re.sub(r'(?:(?<=,)|^)(?=("(?:"")*(?:[^"]+(?:"")*)*"|[^",]*))\1,', r'\1;', s)
online demo
details:
(?: # ensures that results are contiguous
(?<=,) # preceded by a comma (so, the one of the last result)
| # OR
^ # at the start of the string
)
(?= # (?=(a+))\1 is a way to emulate an atomic group: (?>a+)
( # capture the precedent item in group 1
"(?:"")*(?:[^"]+(?:"")*)*" # an item between quotes
|
[^",]* # an item without quotes
)
) \1 # back-reference for the capture group 1
,
The advantage of this way is that it reduces the number of steps to obtain a match and provides a near from constant number of steps whatever the item before (see the regex101 debugger). The reason is that all characters are matched/tested only once. So even the pattern is more long, it is more efficient (and the gain grow up in particular with long lines)
The atomic group trick is only here to reduce the number of steps before failing for the last item (that is not followed by a comma).
Note that the pattern deals with items between quotes with escaped quotes (two consecutive quotes) inside: "abcd""efgh""ijkl","123""456""789",foo

# Python 2.7
import re
text = '''
"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
'''.strip()
print "Before: " + text
print "After: " + ";".join(re.findall(r'(?:"[^"]+"|[^,]+)', text))
This produces the following output:
Before: "3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
After: "3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"
You can tinker with this here if you need more customization.

You can use:
>>> s = 'foo bar,"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"'
>>> print re.sub(r'(?=(([^"]*"){2})*[^"]*$),', ';', s)
foo bar;"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"
RegEx Demo
This will match comma only if it is outside quote by matching even number of quotes after ,.

This regex seems to do the job
,(?=(?:[^"]*"[^"]*")*[^"]*\Z)
Adapted from:
How to match something with regex that is not between two special characters?
And tested with http://pythex.org/

You can split with regex and then join it :
>>> ';'.join([i.strip(',') for i in re.split(r'(,?"[^"]*",?)?',s) if i])
'"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"'

Regex for finding valid sphinx fields

I'm trying to validate that the fields given to sphinx are valid, but I'm having difficulty.
Imagine that valid fields are cat, mouse, dog, puppy.
Valid searches would then be:
#cat search terms
#(cat) search terms
#(cat, dog) search term
#cat searchterm1 #dog searchterm2
#(cat, dog) searchterm1 #mouse searchterm2
So, I want to use a regular expression to find terms such as cat, dog, mouse in the above examples, and check them against a list of valid terms.
Thus, a query such as:
#(goat)
Would produce an error because goat is not a valid term.
I've gotten so that I can find simple queries such as #cat with this regex: (?:#)([^( ]*)
But I can't figure out how to find the rest.
I'm using python & django, for what that's worth.

To match all allowed fields, the following rather fearful looking regex works:
#((?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\))
It returns these matches, in order: #cat, #(cat), #(cat, dog), #cat, #dog, #(cat, dog), #mouse.
The regex breaks down as follows:
# # the literal character "#"
( # match group 1
(?:cat|mouse|dog|puppy) # one of your valid search terms (not captured)
\b # a word boundary
| # or...
\( # a literal opening paren
(?: # non-capturing group
(?:cat|mouse|dog|puppy) # one of your valid search terms (not captured)
(?: # non-capturing group
, * # a comma "," plus any number of spaces
| # or...
(?=\)) # a position followed by a closing paren
) # end non-capture group
)+ # end non-capture group, repeat
\) # a literal closing paren
) # end match group one.
Now to identify any invalid search, you would wrap all that in a negative look-ahead:
#(?!(?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\))
--^^
This would identify any # character after which an invalid search term (or term combination) was attempted. Modifying it so that it also matches the invalid attempt instead of just pointing at it is not that hard anymore.
You would have to prepare (?:cat|mouse|dog|puppy) from your field dynamically and plug it into the static rest of the regex. Should not be too hard to do either.

This pyparsing solution follows a similar logic path as your posted answer. All tags are matched, and then checked against the list of known valid tags, removing them from the reported results. Only those matches that have values left over after removing the valid ones are reported as matches.
from pyparsing import *
# define the pattern of a tag, setting internal results names for easy validation
AT,LPAR,RPAR = map(Suppress,"#()")
term = Word(alphas,alphanums).setResultsName("terms",listAllMatches=True)
sphxTerm = AT + ~White() + ( term | LPAR + delimitedList(term) + RPAR )
# define tags we consider to be valid
valid = set("cat mouse dog".split())
# define a parse action to filter out valid terms, and attach to the sphxTerm
def filterValid(tokens):
tokens = [t for t in tokens.terms if t not in valid]
if not(tokens):
raise ParseException("",0,"")
return tokens
sphxTerm.setParseAction(filterValid)
##### Test out the parser #####
test = """#cat search terms # house
#(cat) search terms
#(cat, dog) search term #(goat)
#cat searchterm1 #dog searchterm2 #(cat, doggerel)
#(cat, dog) searchterm1 #mouse searchterm2
#caterpillar"""
# scan for invalid terms, and print out the terms and their locations
for t,s,e in sphxTerm.scanString(test):
print "Terms:%s Line: %d Col: %d" % (t, lineno(s, test), col(s, test))
print line(s, test)
print " "*(col(s,test)-1)+"^"
print
With these lovely results:
Terms:['goat'] Line: 3 Col: 29
#(cat, dog) search term #(goat)
^
Terms:['doggerel'] Line: 4 Col: 39
#cat searchterm1 #dog searchterm2 #(cat, doggerel)
^
Terms:['caterpillar'] Line: 6 Col: 5
#caterpillar
^
This last snippet will do all the scanning for you, and just give you the list of found invalid tags:
# print out all of the found invalid terms
print list(set(sum(sphxTerm.searchString(test), ParseResults([]))))
Prints:
['caterpillar', 'goat', 'doggerel']

This should work:
#\((cat|dog|mouse|puppy)\b(,\s*(cat|dog|mouse|puppy)\b)*\)|#(cat|dog|mouse|puppy)\b
It will either match a single #parameter or a parenthesized #(par1, par2) list containing only allowed words (one or more).
It also makes sure that no partial matches are accepted (#caterpillar).

Try this:
field_re = re.compile(r"#(?:([^()\s]+)|\([^()]+\))")
A single field name (like cat in #cat) will be captured in group #1, while the names in a parenthesized list like #(cat, dog) will be stored in group #2. In the latter case you'll need to break the list down with split() or something; there's no way to capture the names individually with a Python regex.

This will match all fields that are cat, dog, mouse, or puppy and combinations thereof.
import re
sphinx_term = "#goat some words to search"
regex = re.compile("#\(?(cat|dog|mouse|puppy)(, ?(cat|dog|mouse|puppy))*\)? ")
if regex.search(sphinx_term):
send the query to sphinx...

I ended up doing this a different way, since none of the above worked. First I found the fields like #cat, with this:
attributes = re.findall('(?:#)([^\( ]*)', query)
Next, I found the more complicated ones, with this:
regex0 = re.compile('''
# # at sign
(?: # start non-capturing group
\w+ # non-whitespace, one or more
\b # a boundary character (i.e. no more \w)
| # OR
( # capturing group
\( # left paren
[^#(),]+ # not an #(),
(?: # another non-caputing group
, * # a comma, then some spaces
[^#(),]+ # not #(),
)* # some quantity of this non-capturing group
\) # a right paren
) # end of non-capuring group
) # end of non-capturing group
''', re.VERBOSE)
# and this puts them into the attributes list.
groupedAttributes = re.findall(regex0, query)
for item in groupedAttributes:
attributes.extend(item.strip("(").strip(")").split(", "))
Next, I checked if the attributes I found were valid, and added them (uniquely to an array):
# check if the values are valid.
validRegex = re.compile(r'^mice$|^mouse$|^cat$|^dog$')
# if they aren't add them to a new list.
badAttrs = []
for attribute in attributes:
if len(attribute) == 0:
# if it's a zero length attribute, we punt
continue
if validRegex.search(attribute.lower()) == None:
# if the attribute from the search isn't in the valid list
if attribute not in badAttrs:
# and the attribute isn't already in the list
badAttrs.append(attribute)
Thanks all for the help though. I'm very glad to have had it!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

String manipulation in Regex with seemingly too many exceptions - python

Related

Python - string replace between parenthesis with wildcards

Avoid special values or space between values using python re

Splitting a string with delimiters and conditions

Is this possible using regular expression

Regex for finding valid sphinx fields

Categories

Resources