Regex for finding valid sphinx fields - python

I'm trying to validate that the fields given to sphinx are valid, but I'm having difficulty.
Imagine that valid fields are cat, mouse, dog, puppy.
Valid searches would then be:
#cat search terms
#(cat) search terms
#(cat, dog) search term
#cat searchterm1 #dog searchterm2
#(cat, dog) searchterm1 #mouse searchterm2
So, I want to use a regular expression to find terms such as cat, dog, mouse in the above examples, and check them against a list of valid terms.
Thus, a query such as:
#(goat)
Would produce an error because goat is not a valid term.
I've gotten so that I can find simple queries such as #cat with this regex: (?:#)([^( ]*)
But I can't figure out how to find the rest.
I'm using python & django, for what that's worth.

To match all allowed fields, the following rather fearful looking regex works:
#((?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\))
It returns these matches, in order: #cat, #(cat), #(cat, dog), #cat, #dog, #(cat, dog), #mouse.
The regex breaks down as follows:
# # the literal character "#"
( # match group 1
(?:cat|mouse|dog|puppy) # one of your valid search terms (not captured)
\b # a word boundary
| # or...
\( # a literal opening paren
(?: # non-capturing group
(?:cat|mouse|dog|puppy) # one of your valid search terms (not captured)
(?: # non-capturing group
, * # a comma "," plus any number of spaces
| # or...
(?=\)) # a position followed by a closing paren
) # end non-capture group
)+ # end non-capture group, repeat
\) # a literal closing paren
) # end match group one.
Now to identify any invalid search, you would wrap all that in a negative look-ahead:
#(?!(?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\))
--^^
This would identify any # character after which an invalid search term (or term combination) was attempted. Modifying it so that it also matches the invalid attempt instead of just pointing at it is not that hard anymore.
You would have to prepare (?:cat|mouse|dog|puppy) from your field dynamically and plug it into the static rest of the regex. Should not be too hard to do either.

This pyparsing solution follows a similar logic path as your posted answer. All tags are matched, and then checked against the list of known valid tags, removing them from the reported results. Only those matches that have values left over after removing the valid ones are reported as matches.
from pyparsing import *
# define the pattern of a tag, setting internal results names for easy validation
AT,LPAR,RPAR = map(Suppress,"#()")
term = Word(alphas,alphanums).setResultsName("terms",listAllMatches=True)
sphxTerm = AT + ~White() + ( term | LPAR + delimitedList(term) + RPAR )
# define tags we consider to be valid
valid = set("cat mouse dog".split())
# define a parse action to filter out valid terms, and attach to the sphxTerm
def filterValid(tokens):
tokens = [t for t in tokens.terms if t not in valid]
if not(tokens):
raise ParseException("",0,"")
return tokens
sphxTerm.setParseAction(filterValid)
##### Test out the parser #####
test = """#cat search terms # house
#(cat) search terms
#(cat, dog) search term #(goat)
#cat searchterm1 #dog searchterm2 #(cat, doggerel)
#(cat, dog) searchterm1 #mouse searchterm2
#caterpillar"""
# scan for invalid terms, and print out the terms and their locations
for t,s,e in sphxTerm.scanString(test):
print "Terms:%s Line: %d Col: %d" % (t, lineno(s, test), col(s, test))
print line(s, test)
print " "*(col(s,test)-1)+"^"
print
With these lovely results:
Terms:['goat'] Line: 3 Col: 29
#(cat, dog) search term #(goat)
^
Terms:['doggerel'] Line: 4 Col: 39
#cat searchterm1 #dog searchterm2 #(cat, doggerel)
^
Terms:['caterpillar'] Line: 6 Col: 5
#caterpillar
^
This last snippet will do all the scanning for you, and just give you the list of found invalid tags:
# print out all of the found invalid terms
print list(set(sum(sphxTerm.searchString(test), ParseResults([]))))
Prints:
['caterpillar', 'goat', 'doggerel']

This should work:
#\((cat|dog|mouse|puppy)\b(,\s*(cat|dog|mouse|puppy)\b)*\)|#(cat|dog|mouse|puppy)\b
It will either match a single #parameter or a parenthesized #(par1, par2) list containing only allowed words (one or more).
It also makes sure that no partial matches are accepted (#caterpillar).

Try this:
field_re = re.compile(r"#(?:([^()\s]+)|\([^()]+\))")
A single field name (like cat in #cat) will be captured in group #1, while the names in a parenthesized list like #(cat, dog) will be stored in group #2. In the latter case you'll need to break the list down with split() or something; there's no way to capture the names individually with a Python regex.

This will match all fields that are cat, dog, mouse, or puppy and combinations thereof.
import re
sphinx_term = "#goat some words to search"
regex = re.compile("#\(?(cat|dog|mouse|puppy)(, ?(cat|dog|mouse|puppy))*\)? ")
if regex.search(sphinx_term):
send the query to sphinx...

I ended up doing this a different way, since none of the above worked. First I found the fields like #cat, with this:
attributes = re.findall('(?:#)([^\( ]*)', query)
Next, I found the more complicated ones, with this:
regex0 = re.compile('''
# # at sign
(?: # start non-capturing group
\w+ # non-whitespace, one or more
\b # a boundary character (i.e. no more \w)
| # OR
( # capturing group
\( # left paren
[^#(),]+ # not an #(),
(?: # another non-caputing group
, * # a comma, then some spaces
[^#(),]+ # not #(),
)* # some quantity of this non-capturing group
\) # a right paren
) # end of non-capuring group
) # end of non-capturing group
''', re.VERBOSE)
# and this puts them into the attributes list.
groupedAttributes = re.findall(regex0, query)
for item in groupedAttributes:
attributes.extend(item.strip("(").strip(")").split(", "))
Next, I checked if the attributes I found were valid, and added them (uniquely to an array):
# check if the values are valid.
validRegex = re.compile(r'^mice$|^mouse$|^cat$|^dog$')
# if they aren't add them to a new list.
badAttrs = []
for attribute in attributes:
if len(attribute) == 0:
# if it's a zero length attribute, we punt
continue
if validRegex.search(attribute.lower()) == None:
# if the attribute from the search isn't in the valid list
if attribute not in badAttrs:
# and the attribute isn't already in the list
badAttrs.append(attribute)
Thanks all for the help though. I'm very glad to have had it!

Related

Select a specific set of characters in a file using regex in python

In my code I have views defined like below.
VIEW Company_Person_Sd IS
Prompt = 'Company'
Company.Prompt = 'Company ID'
SELECT company_id company_id,
emp_no emp_no,
Get_Person(company_id, emp_no) person_id,
cp.rowid objid,
to_char(cp.rowversion) objversion,
rowkey objkey
FROM companies cp;
There can be more than one view defined in a single file (usually there are 20 or more).
I want to get the whole view using a regex in python.
I did the same thing to a method like below, using the following regex.(and it worked fine)
methodRegex = r"^\s*((FUNCTION|PROCEDURE)\s+(\w+))(.*?)BEGIN(.*?)^END\s*(\w+);"
methodMatches = re.finditer(methodRegex, fContent, re.DOTALL | re.MULTILINE | re.IGNORECASE | re.VERBOSE)
for methodMatchNum, methodMatch in enumerate(methodMatches, start=1):
methodContent=methodMatch.group()
methodNameFull=methodMatch.group(1)
methodType=methodMatch.group(2)
methodName=methodMatch.group(3)
method example
PROCEDURE Prepare___ (
attr_ IN OUT VARCHAR2 )
IS
----
BEGIN
--
END Prepare___;
PROCEDURE Insert___ (
attr_ IN OUT VARCHAR2 )
IS
----
BEGIN
--
END Insert___;
When I try to do the same for views, it gives the wrong output.
Actually I couldn't find how to catch the end of the view. I tried with semicolon as well, which gave a wrong output.
My regex for views
viewRegex = r"^\s*(VIEW\s+(\w+))(.*?)SELECT(.*?)^FROM\s*(\w+);"
Please help me find out where I'm doing it wrong. Thanks in advance.
If you have a lot of views in a single file, another option is to prevent using .*? with re.DOTALL to prevent unnecessary backtracking.
Instead, you can match the parts from VIEW to SELECT to FROM checking that what is in between is not another one of the key words to prevent matching too much using a negative lookahead (Assuming these can not occur in between)
For the last part after FROM, you can match word characters, optionally repeated by whitespace chars and again word characters.
^(VIEW\s+(\w+))(.*(?:\n(?!SELECT|VIEW|FROM).*)*)\nSELECT\s+(.*(?:\n(?!SELECT|VIEW|FROM).*)*)\nFROM\s+(\w+(?:\s+\w+));
The pattern matches:
^ Start of string
(VIEW\s+(\w+)) Capture group for VIEW followed by a group for the word characters
(.*(?:\n(?!SELECT|VIEW|FROM).*)*) Capture group matching the rest of the lines, and all lines that do not start with a keyword
\nSELECT\s+ Match a newline, SELECT and 1+ whitespace cahrs
(.*(?:\n(?!SELECT|VIEW|FROM).*)*) Capture group matching the rest of the lines, and all lines that do not start with a keyword
\nFROM\s+ Match a newline, FROM and 1+whitespace chars
(\w+(?:\s+\w+)); Capture group for the value of FROM, matching 1+ word characters and optionally repeated by whitespace chars and word characters
Regex demo
For example (You can omit the re.VERBOSE and re.DOTALL)
import re
methodRegex = r"^^(VIEW\s+(\w+))(.*(?:\n(?!SELECT|VIEW|FROM).*)*)\nSELECT\s+(.*(?:\n(?!SELECT|VIEW|FROM).*)*)\nFROM\s+(\w+(?:\s+\w+));"
fContent = ("VIEW Company_Person_Sd IS\n"
" Prompt = 'Company'\n"
" Company.Prompt = 'Company ID'\n"
"SELECT company_id company_id,\n"
" emp_no emp_no,\n"
" Get_Person(company_id, emp_no) person_id,\n"
" cp.rowid objid,\n"
" to_char(cp.rowversion) objversion,\n"
" rowkey objkey\n"
"FROM companies cp;")
methodMatches = re.finditer(methodRegex, fContent, re.MULTILINE | re.IGNORECASE)
for methodMatchNum, methodMatch in enumerate(methodMatches, start=1):
methodContent = methodMatch.group()
methodNameFull = methodMatch.group(1)
methodType = methodMatch.group(2)
methodName = methodMatch.group(3)
You don't get any match with viewRegex because it only matches when there are only word characters ([a-zA-Z0-9_]) between FROM and ;. Whereas your example also includes a whitespace. So take whitespaces into account as well:
viewRegex = r"^\s*(VIEW\s+(\w+))(.*?)SELECT(.*?)^FROM\s*([\w\s]+);"

Building a regular expression to find text near each other

I'm having issue getting this search to work:
import re
word1 = 'this'
word2 = 'that'
sentence = 'this and that'
print(re.search('(?:\b(word1)\b(?: +[^ \n]*){0,5} *\b(word2)\b)|(?:\b(word2)\b(?: +[^ \n]*){0,5} *\b(word1)\b)',sentence))
I need to build a regex search to find if a string has up to 5 different sub-strings in any order within a certain number of other words (so two strings could be 3 words apart, three strings a total of 6 words apart, etc).
I've found a number of similar questions such as Regular expression gets 3 words near each other. How to get their context? or How to check if two words are next to each other in Python?, but none of them quite do this.
So if the search words were 'this', 'that', 'these', and 'those' and they appeared within 9 words of each other in any order, then the script would output True.
It seems like writing an if/else block with all sorts of different regex statements to accommodate the different permutations would be rather cumbersome, so I'm hoping there is a more efficient way to code this in Python.
This can be done using engines that support conditionals, atomic groups
and capture group status as flaged, marked EMPTY or NULL. Where null is undefined.
So this is almost all modern engines. Some are incomplete though like JS.
Python can support this using its replacement engine import regex.
Basically this will support out of order and can be confined to the shortest
range from 4 to 9 total words.
The bottom (?= \1 \2 \3 \4 ) asserts that all the required items were found.
Using this without the atomic group might cause backtrack problems, but since it
is there, this regex is very fast.
update: added lookahead (?= this | that | these | those ) so it starts match on a special word.
Python code
>>> import regex
>>>
>>> targ = 'this sdgbsesfrgnh these meat ball those nhwsgfr that sfdng sfgnsefn sfgnndfsng'
>>> pat = r'(?=this|that|these|those)(?>\s*(?:(?(1)(?!))\bthis\b()|(?(2)(?!))\bthat\b()|(?(3)(?!))\bthese\b()|(?(4)(?!))\bthose\b()|(?(5)(?!))\b(.+?)\b|(?(6)(?!))\b(.+?)\b|(?(7)(?!))\b(.+?)\b|(?(8)(?!))\b(.+?)\b|(?(9)(?!))\b(.+?)\b)\s*){4,9
}?(?=\1\2\3\4)'
>>>
>>> regex.search(pat, targ).group()
'this sdgbsesfrgnh these meat ball those nhwsgfr that '
General PCRE / Perl et all (same regex)
(?=this|that|these|those)(?>\s*(?:(?(1)(?!))\bthis\b()|(?(2)(?!))\bthat\b()|(?(3)(?!))\bthese\b()|(?(4)(?!))\bthose\b()|(?(5)(?!))\b(.+?)\b|(?(6)(?!))\b(.+?)\b|(?(7)(?!))\b(.+?)\b|(?(8)(?!))\b(.+?)\b|(?(9)(?!))\b(.+?)\b)\s*){4,9}?(?=\1\2\3\4)
https://regex101.com/r/zhSa64/1
(?= this | that | these | those )
(?>
\s*
(?:
(?(1)(?!))
\b this \b ( ) # (1)
|
(?(2)(?!))
\b that \b ( ) # (2)
|
(?(3)(?!))
\b these \b ( ) # (3)
|
(?(4)(?!))
\b those \b ( ) # (4)
|
(?(5)(?!))
\b ( .+? ) \b # (5)
|
(?(6)(?!))
\b ( .+? ) \b # (6)
|
(?(7)(?!))
\b ( .+? ) \b # (7)
|
(?(8)(?!))
\b ( .+? ) \b # (8)
|
(?(9)(?!))
\b ( .+? ) \b # (9)
)
\s*
){4,9}?
(?= \1 \2 \3 \4 )
ANSWER CHANGED because I found a way to do it with just a regular expression. The approach is to start with a lookahead that requires all target words to be present in the next N words. Then look for a pattern of target words (in any order) separated by 0 or more other words (up to the allowed maximum intermediate words)
The word span (N) is the greatest number of words that would allow all the target words to be at the maximum allowed distance.
For example, if we have 3 target words, and we allow a maximum of 4 other words between them, then the maximum word span will be 11. So 3 target words plus 2 intermediate series of maximum 4 other words 3+4+4=11.
The search pattern is formed by assembling parts that depend on the words and the maximum number of intermediate words allowed.
Pattern : \bALL((ANY)(\W+\w+\W*){0,INTER}){COUNT,COUNT}
breakdown:
\b start on a word boundary
ALL will be substituted by multiple lookaheads that will ensure that every target word is found in the next N words.
each lookahead will have the form (?=(\w+\W*){0,SPAN}WORD\b) where WORD is a target word and SPAN is the number of other words in the longest possible sequence of words. There will be one such lookahead for each of the target words. Thus ensuring that the sequence of N words contains all of target words.
(\b(ANY)(\W+\w+\W*){0,INTER}) matches any target word followed by zero to maxInter intermediate words. In that, ANY will be replaced by a pattern that matches any of the target words (i.e. the words separated by pipes). And INTER will be replaced by the allowed number of intermediate words.
{COUNT,COUNT} ensured that there are as many repetitions of the above as there are target words. This corresponds to the pattern: targetWord+intermediates+targetWord+intermediates...+targetWord
With the look ahead placed before the repeating pattern, we are guaranteed to have all the target words in the sequence of words containing exactly the number of target words with no more intermediate words than is allowed.
...
import re
words = {"this","that","other"}
maxInter = 3 # maximum intermediate words between the target words
wordSpan = len(words)+maxInter*(len(words)-1)
anyWord = "|".join(words)
allWords = "".join(r"(?=(\w+\W*){0,SPAN}WORD\b)".replace("WORD",w)
for w in words)
allWords = allWords.replace("SPAN",str(wordSpan-1))
pattern = r"\bALL(\b(ANY)(\W+\w+\W*){0,INTER}){COUNT,COUNT}"
pattern = pattern.replace("COUNT",str(len(words)))
pattern = pattern.replace("INTER",str(maxInter))
pattern = pattern.replace("ALL",allWords)
pattern = pattern.replace("ANY",anyWord)
textList = [
"looking for this and that and some other thing", # YES
"that rod is longer than this other one", # NO: 4 words apart
"other than this, I have nothing", # NO: missing "that"
"ignore multiple words not before this and that or other", # YES
"this and that or other, followed by a bunch of words", # YES
]
output:
print(pattern)
\b(?=(\w*\b\W+){0,8}this\b)(?=(\w*\b\W+){0,8}other\b)(?=(\w*\b\W+){0,8}that\b)(\b(other|this|that)\b(\w*\b\W+){0,3}){3,3}
for text in textList:
found = bool(re.search(pattern,text))
print(found,"\t:",text)
True : looking for this and that and some other thing
False : that rod is longer than this other one
False : other than this, I have nothing
True : ignore multiple words not before this and that or other
True : this and that or other, followed by a bunch of words

String manipulation in Regex with seemingly too many exceptions

In this post I am calling strings of the form (x1, ..., xn) a sequence and strings of the form {y1, ..., yn} a set. Where each xi and yi can be either a number [0-9]+, a word [a-zA-Z]+, a number/word [a-zA-Z0-9]+, a sequence, or a set.
I want to know if it is at all possible (and if so, help figuring out how) to use Regex to deal with the following:
I want to transform sequences of the form (x1, ..., xn, xn+1) into ((x1, ..., xn), xn+1).
Examples:
(1,2,3,4,5) will change to ((1,2,3,4),5)
((1,2,3,4),5) will change to (((1,2,3),4),5) since the only string of the form (x1, ..., xn, xn+1) is the (1,2,3,4) on the inside.
(((1,2,3),4),5) will change to ((((1,2),3),4),5) since the only string of the form (x1, ..., xn, xn+1) is the (1,2,3) on the inside.
(1,(2,3),4) will change to ((1,(2,3)),4)
({1},{2},{3}) will change to (({1},{2}),{3})
As per request, more examples:
((1,2),(3,4),5) will change to (((1,2),(3,4)),5)
((1,2),3,(4,5)) will change to (((1,2),3),(4,5))
(1,(2,(3,4),5)) will change to (1,((2,(3,4)),5)) since the only sequence of the form (x1, ..., xn, xn+1) is the (2,(3,4),5) on the inside.
Here is what I have so far:
re.sub(r'([(][{}a-zA-Z0-9,()]+,[{}a-zA-Z0-9,]+),', r'(\g<1>),', string)
You can see the strings it works for here.
It is not working for (1,(2,3),4), and it is working on things like ({{x},{x,y}},z) when it shouldn't be. Any help is greatly appreciated; I feel like it is possible to get this to work in Regex but there seem to be a lot of special cases that require the Regex to be very precise.
For any recursion in python, you'll have to use the PyPi regex module (as mentioned by anubhava in his answer). You can use the following pattern:
See regex in use here
(\((({(?:[^{}]+|(?-1))+})|\((?:[^(),]+,[^(),]+)\)|[^(){},]+)(?:,(?2))+),
Replace with (\1),
How it works:
(\((({(?:[^{}]+|(?-1))+})|\((?:[^(),]+,[^(),]+)\)|[^(){},]+)(?:,(?2))+), capture the following into capture group 1, then match ,
\( match ( literally
(({(?:[^{}]+|(?-1))+})|\((?:[^(),]+,[^(),]+)\)|[^(){},]+) match one of the following options one or more times (and capture it into capture group 2)
({(?:[^{}]+|(?-1))+}) capture the following into capture group 3
{(?:[^{}]+|(?-1))+} matches {, then one of the following options one or more times, then }
[^{}]+ matches any character except { or } one or more times
(?-1) recurses the previous capture group (capture group 3)
\((?:[^(),]+,[^(),]+)\) matches (, then the following, then )
[^(),]+ matches any character except (, ), or , one or more times
, matches the comma , character literally
[^(),]+ matches any character except (, ), or , one or more times
[^(){},]+ matches any character except (, ), {, }, or , one or more times
(?:,(?2))+ matches the following one or more times
,(?2) matches ,, then recurses capture group 2
In simpler terms, capture group 2 defines what a term is. It...:
Matches any sets {y1, ..., yn} + recusively: {{y1, ..., yn},..., xn}
Matches any complete sequences of exactly two elements: (x1, x2)
Matches any string object (numbers, words, etc.) 1, 2, ... x, y, ...
Then capture group 1 uses the well-defined terms from capture group 2 to match as many terms as possible, with the string containing at least two terms and a comma (x,x, with as many x, as possible). The replacement takes this capture group, encases it in () and appends ,. So in the case of (x,x,x,x), we get ((x,x,x),x).
Edit
By making the non-capture group possessive (?:[^{}]+|(?-1))++ (prevents backtracking) and changing the order of the options (most prevalent first), we can improve the efficiency of the pattern (764 -> 662 steps):
See regex in use here
(\(([^(){},]+|\((?:[^(),]+,[^(),]+)\)|({(?:[^{}]+|(?-1))++}))(?:,(?2))+),
If you can consider using PyPi regex module of Python that supports PCRE features then it is possible with recursive matching support using this regex:
/
( # start capture group #1
\( # match left (
(?<el> # start named group el
( { (?: [^{}]*+ | (?-1) )* } ) | # Match {...} text OR
( \( (?: [^()]*+ | (?-1) )*+ \) ) | # Match (...) text OR
\w+ # Match 1+ word characters
) # End named group el
(?: , (?&el) )+ # Match comma followed by recursion of 'el'
# Match this group it 1+ times
) , # End capture group #1
/ x # Enable extended mode in regex
RegEx Demo
Here's a simple parser built with ply
It is certainly less compact than the regex solutions, but it has a couple of considerable advantages:
It's a lot easier to write and understand. (In fact, my first attempt worked perfectly except for a typo in one of the names.) Moreover, it is reasonably clear by examination exactly what syntax is being parsed. (This assumes some minimal understanding of generative grammars, of course, but the concepts are not particularly difficult and there are many available learning resources.)
If you want to add more features in the future, it's straight-forward to modify. If instead of just reformatting the text, you want to actually make some use of the decomposed structure, that is easily available without much effort.
As with most generated parsers, it has two components: a lexer (or scanner), which decomposes the input into tokens and discards unimportant text such as whitespace, and a parser which analyses the stream of tokens in order to figure out its structure. Normally a parser would construct some kind of data structure representing the input, normally some kind of tree. Here I've simplified the process by just recombining the parsed input into a transformed output. (In retrospect, I can't help thinking that, as usual, it would have been clearer to produce an entire parse tree and then create the output by doing a walk over the tree. Perhaps as penance I'll redo it later.)
Here's the scanner. The only meaningful tokens are the punctuation symbols and what I've called WORDs, which are sequences of whatever Python considers word characters (usually alphabetic and numeric characters plus underlines), without distinguishing between purely alphabetic, purely numeric, and mixed tokens as in your question.
import ply.lex as lex
tokens = [ "WORD" ]
t_WORD = r"\w+"
# Punctuation
literals = "{}(),"
# Ignore whitespace
t_ignore = " \r\n\t"
# Anything else is an error
def t_error(t):
print("Illegal character %s" % repr(t.value[0]))
t.lexer.skip(1)
# Build the lexer
lexer = lex.lex()
Now the parser. The grammar for sequences is a little redundant because it has to special-case a sequence of one item: Since the grammar also explicitly inserts parentheses around A,B as it parses, it would be incorrect to add them around the entire sequence. But if the entire sequence is one item, the original parenthweses have to be reinserted. For sets, things are much clearer; the elements are not modified at all, the braces must always be added back.
Here's the entire grammar:
# scalar : WORD | set | sequence
# sequence : '(' scalar ')'
# | '(' seqlist ')'
# seqlist : scalar ',' scalar
# | seqlist ',' scalar
# set : '{' setlist '}'
# setlist : scalar
# | setlist ',' scalar
And here's the implementation, with the grammar repeated, Ply-style, as docstrings:
import ply.yacc as yacc
start = 'scalar'
def p_unit(p):
"""scalar : WORD
| set
| sequence
setlist : scalar
"""
p[0] = p[1]
def p_sequence_1(p):
"""sequence : '(' scalar ')'
"""
p[0] = '(%s)' % p[2]
def p_sequence(p):
"""sequence : '(' seqlist ')'
"""
p[0] = p[2]
def p_seqlist(p):
"""seqlist : scalar ',' scalar
| seqlist ',' scalar
"""
p[0] = "(%s,%s)" % (p[1], p[3])
def p_set(p):
"""set : '{' setlist '}'
"""
p[0] = '{%s}' % p[2]
def p_setlist(p):
"""setlist : setlist ',' scalar
"""
p[0] = "%s,%s" % (p[1], p[3])
def p_error(p):
if p:
print("Syntax error at token", p.type)
else:
print("Syntax error at EOF")
parser = yacc.yacc()
Now, a (very) simple driver:
import readline
while True:
try:
s = input('> ')
except EOFError:
break
if s:
print(parser.parse(s, lexer=lexer))

How to use RegEx in an if statement in Python?

I'm doing something like "Syntax Analyzer" with Kivy, using re (regular expresions).
I only want to check a valid syntax for basic operations (like +|-|*|/|(|)).
The user tape the string (with keyboard) and I validate it with regex.
But I don't know how to use regex in an if statement. That I want is: If the string that user brings me isn't correct (or doesn't check with regex) print something like "inavlid string" and if is correct print "Valid string".
I've tried with:
if re.match(patron, string) is not None:
print ("\nTrue")
else:
print("False")
but, it doesn't matter what do string has, the app always show True.
Sorry my poor english. Any help would be greatly appreciated!
import re
patron= re.compile(r"""
(
-?\d+[.\d+]?
[+*-/]
-?\d+[.\d+]?
[+|-|*|/]?
)*
""", re.X)
obj1= self.ids['text'].text #TextInput
if re.match(patron, obj1) is not None:
print ("\nValid String")
else:
print("Inavlid string")
if obj1= "53.22+22.11+10*555+62+55.2-66" actually it's correct and app prints "Valid..." but if I put an a like this "a53.22+22.11+10*555+62+55.2-66" it's incorrect and the app must prints invalid.. but instead it still valid.
Your regex always matches because it allows the empty string to match (since the entire regex is enclosed in an optional group.
If you test this live on regex101.com, you can immediately see this and also that it doesn't match the entire string but only parts of it.
I've already corrected two errors in your character classes concerning the use of unnecessary/harmful alternation operators (|) and incorrect placement of the dash, making it into a range operator (-), but it's still incorrect.
I think you want something more like this:
^ # Make sure the match begins at the start of the string
(?: # Start a non-capturing group that matches...
-? # an optional minus sign,
\d+ # one or more digits
(?:\.\d+)? # an optional group that contains a dot and one or more digits.
(?: # Start of a non-capturing group that either matches...
[+*/-] # an operator
| # or
$ # the end of the string.
) # End of inner non-capturing group
)+ # End of outer non-capturing group, required to match at least once.
(?<![+*/-]) # Make sure that the final character isn't an operator.
$ # Make sure that the match ends at the end of the string.
Test it live on regex101.com.
This answers your question about how to use if with regex:
Caveat: the regex formula will not weed out all invalid inputs, e.g., two decimal points (".."), two operators ("++"), and such. So please adjust it to suit your exact needs)
import re
regex = re.compile(r"[\d.+\-*\/]+")
input_list = [
"53.22+22.11+10*555+62+55.2-66", "a53.22+22.11+10*555+62+55.2-66",
"53.22+22.pq11+10*555+62+55.2-66", "53.22+22.11+10*555+62+55.2-66zz",
]
for input_str in input_list:
mmm = regex.match(input_str)
if mmm and input_str == mmm.group():
print('Valid: ', input_str)
else:
print('Invalid: ', input_str)
Above as a function for use with a single string instead of a list:
import re
regex = re.compile(r"[\d.+\-*\/]+")
def check_for_valid_string(in_string=""):
mmm = regex.match(in_string)
if mmm and in_string == mmm.group():
return 'Valid: ', in_string
return 'Invalid: ', in_string
check_for_valid_string('53.22+22.11+10*555+62+55.2-66')
check_for_valid_string('a53.22+22.11+10*555+62+55.2-66')
check_for_valid_string('53.22+22.pq11+10*555+62+55.2-66')
check_for_valid_string('53.22+22.11+10*555+62+55.2-66zz')
Output:
## Valid: 53.22+22.11+10*555+62+55.2-66
## Invalid: a53.22+22.11+10*555+62+55.2-66
## Invalid: 53.22+22.pq11+10*555+62+55.2-66
## Invalid: 53.22+22.11+10*555+62+55.2-66zz

regex continue only if positive lookahead has been matched at least once

Using python: How do i get the regex to continue only if a positive lookahead has been matched at least once.
I'm trying to match:
Clinton-Orfalea-Brittingham Fellowship Program
Here's the code I'm using now:
dp2= r'[A-Z][a-z]+(?:-\w+|\s[A-Z][a-z]+)+'
print np.unique(re.findall(dp2, tt))
I'm matching the word, but it's also matching a bunch of other extraneous words.
My thought was that I'd like the \s[A-Z][a-z] to kick in ONLY IF -\w+ has been hit at least once (or maybe twice). would appreciate any thoughts.
To clarify: I'm not aiming to match specifically this set of words, but to be able to generically match Proper noun- Proper noun- (indefinite number of times) and then a non-hyphenated Proper noun.
eg.
Noun-Noun-Noun Noun Noun
Noun-Noun Noun
Noun-Noun-Noun Noun
THE LATEST ITERATION:
dp5= r'(?:[A-Z][a-z]+-?){2,3}(?:\s\w+){2,4}'
The {m,n} notation can be used to force the regex to ONLY MATCH if the previous expression exists between m and n times. Maybe something like
(?:[A-Z][a-z]+-?){2,3}\s\w+\s\w+ # matches 'Clinton-Orfalea-Brittingham Fellowship Program'
If you're SPECIFICALLY looking for "Clinton-Orfalea-Brittingham Fellowship Program", why are you using Regex to find it? Just use word in string. If you're looking for things of the form: Name-Name-Name Noun Noun, this should work, but be aware that Name-Name-Name-Name Noun Noun won't, nor will Name-Name-Name Noun Noun Noun (In fact, something like "Alice-Bob-Catherine Program" will match not only that but whatever word comes after it!)
# Explanation
RE = r"""(?: # Begins the group so we can repeat it
[A-Z][a-z]+ # Matches one cap letter then any number of lowercase
-? # Allows a hyphen at the end of the word w/o requiring it
){2,3} # Ends the group and requires the group match 2 or 3 times in a row
\s\w+ # Matches a space and the next word
\s\w+ # Does so again
# those last two lines could just as easily be (?:\s\w+){2}
"""
RE = re.compile(RE,re.verbose) # will compile the expression as written
If you're looking specifically for hyphenated proper nouns followed by non-hyphenated proper nouns, I would do this:
[A-Z][a-z]+-(?:[A-Z][a-z]+(?:-|\s))+
# Explanation
RE = r"""[A-Z][a-z]+- # Cap letter+small letters ending with a hyphen
(?: # start a non-cap group so we can repeat it
[A-Z][a-z]+# As before, but doesn't require a hyphen
(?:
-|\s # but if it doesn't have a hyphen, it MUST have a space
) # (this group is just to give precedence to the |
)+ # can match multiple of these.
"""

Categories

Resources