Regex to match an identifier and rejecting those containing invalid character - python

I am defining a regex to match my defined identifiers - an identifier has to start with a letter followed by any number of letters, numbers, and underscores.
I have my current regex r'[A-Za-z][A-Za-z0-9_]*' and it works great except for cases like this: if I send in: testid#entifier_, it returns a match for testid and entifier_. I want it to completely reject the identifier. Not match parts of it.
It just ends up splitting them.
What can I do without using a complex look-ahead for legal chars?
Input is simply:
arg = sys.argv[1]
file = open(arg)
inLines = file.read()
file.close()
tokens = lexer(inLines, tokenFormats)
A sample of my defined regex's are like this:
tokenFormats = [
(r'[\s\n\t]+', None), #Whitespace
(r'\/\*(\*(?!\/)|[^*])*\*\/', None), #Comment
(r'\(', LParent),
(r'\)', RParent),
(r'\[', LBracket),
(r'\]', RBracket),
(r'\{', LBrace),
(r'\}', RBrace),
(r'\,', CommaT),
(r'(?<="{1}).*?(?=")', STRLITERAL),
(r'\"', QuoteT),
(r'\.', PeriodT),
(r'\-?[0-9]*\.[0-9]+', ValueR),
(r'\+', AddT),
(r'-', AddT),
(r'\|\|', AddT),
(r';', Semicolon),
My matching loop is like this:
def lexer(input, tokenFormats):
pos = 0
tokens = []
while pos < len(input):
match = None
for tokenFormat in tokenFormats:
pattern, tag = tokenFormat
regex = re.compile(pattern)
match = regex.match(input,pos) #Essentially Build Lexeme
if match:
lexeme = match.group(0)
if tag:
if tag == Identifier and len(str(lexeme)) > 27: #rough fix to check length. Very hacky
sys.stderr.write('Illegal length for identifier: %s\n' % lexeme)
break;
attr = checkForAttribute(lexeme,tag)
token = (lexeme,tag,attr)
tokens.append(token)
break
else:
break
if not match:
sys.stderr.write('Illegal or unknown character: %s\n' % input[pos])
pos = pos + 1
else:
pos = match.end(0)
return tokens

Try anchoring your expression:
r'^[A-Za-z][A-Za-z0-9_]*$'
This requires that the entire identifier matches the expression, not just part of it because you are anchoring the expression to the beginning and end of the string. This prevents part of the string from matching.

If the # symbol is your only concern, try this r'[a-zA-Z]#?[a-zA-Z0-9_]+'.

If you want to allow the # as well you could use the following regex:
r'[A-Za-z][A-Za-z0-9_]*#?[A-Za-z0-9_]*'
tested: https://regex101.com/r/vlt8qo/3/
however following the description of your problem:
I am defining a regex to match my defined identifiers - an identifier has to start with a letter followed by any number of letters, numbers, and underscores.
it looks like that there is some incoherence since # are not defined as part of your identifiers...
Following your edit in the post:
I have adapted my regex to ->
r'(?<=[\(\)\]\[\-=\+\s\n\t,;\|\.\"])[A-Za-z][A-Za-z0-9_]*(?=[\(\)\]\[\-=\+\s\n\t,;\|\.\"])|^[A-Za-z][A-Za-z0-9_]*(?=[\(\)\]\[\-=\+\s\n\t,;\|\.\"])'
and tested it on several patterns #
https://regex101.com/r/vlt8qo/5/

Related

How can I replace part of a string with a pattern

for example is the string is "abbacdeffel" and the pattern being "xyyx" replaced with "1234"
so it would result from "abbacdeffel" to "1234cd1234l"
I have tried to think this out but I couldnt come up with anything. At first I thought maybe dictionary could help but still nothing came to mind.
What you're looking to do can be accomplished by using regex, or more commonly known as, Regular Expressions. Regular Expressions in programming enables you to extract what you want and just what you want from a string.
In your case, you want to match the string with the pattern abba so using the following regex:
(\w+)(\w+)\2\1
https://regex101.com/r/hP8lA3/1
You can match two word groups and use backreferences to make sure that the second group comes first, then the first group.
So implementing this in python code looks like this:
First, import the regex module in python
import re
Then, declare your variable
text = "abbacdeffel"
The re.finditer returns an iterable so you can iterate through all the groups
matches = re.finditer(r"(\w)(\w)\2\1", text)
Go through all the matches that the regexp found and replace the pattern with "1234"
for match in matches:
text = text.replace(match.group(0), "1234")
For debugging:
print(text)
Complete Code:
import re
text = "abbacdeffel"
matches = re.finditer(r"(\w)(\w)\2\1", text)
for match in matches:
text = text.replace(match.group(0), "1234")
print(text)
You can learn more about Regular Expressions here: https://regexone.com/references/python
New version of code (there was a bug):
def replace_with_pattern(pattern, line, replace):
from collections import OrderedDict
set_of_chars_in_pattern = set(pattern)
indice_start_pattern = 0
output_line = ""
while indice_start_pattern < len(line):
potential_end_pattern = indice_start_pattern + len(pattern)
subline = line[indice_start_pattern:potential_end_pattern]
print(subline)
set_of_chars_in_subline = set(subline)
if len(set_of_chars_in_subline)!= len(set_of_chars_in_pattern):
output_line += line[indice_start_pattern]
indice_start_pattern +=1
continue
map_of_chars = OrderedDict()
liste_of_chars_in_pattern = []
for char in pattern:
if char not in liste_of_chars_in_pattern:
liste_of_chars_in_pattern.append(char)
print(liste_of_chars_in_pattern)
for subline_char in subline:
if subline_char not in map_of_chars.values():
map_of_chars[liste_of_chars_in_pattern.pop(0)] =subline_char
print(map_of_chars)
wanted_subline = ""
for char_of_pattern in pattern:
wanted_subline += map_of_chars[char_of_pattern]
print("wanted_subline =" + wanted_subline)
if subline == wanted_subline:
output_line += replace
indice_start_pattern += len(pattern)
else:
output_line += line[indice_start_pattern]
indice_start_pattern += 1
return output_line
some test :
test1 = replace_with_pattern("xyyx", "abbacdeffel", "1234")
test2 = replace_with_pattern("abbacdeffel", "abbacdeffel", "1234")
print(test1, test2)
=> 1234cd1234l 1234
Here goes my attempt:
([a-zA-Z])(?!\1)([a-zA-Z])\2\1
Assuming you want to match letters only (if other ranges, change both [a-zA-Z] as appropriate, we have:
([a-zA-Z])
Find the first character, and note it so we can later refer to it with \1.
(?!\1)
Check to see if the next character is not the same as the first, but without advancing the search pointer. This is to prevent aaaa being accepted. If aaaa is OK, just remove this subexpression.
([a-zA-Z])
Find the second character, and note it so we can later refer to it with \2.
\2\1
Now find the second again, then the first again, so we match the full abba pattern.
And finally, to do a replace operation, the full command would be:
import re
re.sub(r'([a-zA-Z])(?!\1)([a-zA-Z])\2\1',
'1234',
'abbacdeffelzzzz')
The r at the start of the regex pattern is to prevent Python processing the backslashes. Without it, you would need to do:
import re
re.sub('([a-zA-Z])(?!\\1)([a-zA-Z])\\2\\1',
'1234',
'abbacdeffelzzzz')
Now, I see the spec has expanded to a user-defined pattern; here is some code that will build that pattern:
import re
def make_re(pattern, charset):
result = ''
seen = []
for c in pattern:
# Is this a letter we've seen before?
if c in seen:
# Yes, so we want to match the captured pattern
result += '\\' + str(seen.index(c)+1)
else:
# No, so match a new character from the charset,
# but first exclude already matched characters
for i in xrange(len(seen)):
result += '(?!\\' + str(i + 1) + ')'
result += '(' + charset + ')'
# Note we have seen this letter
seen.append(c)
return result
print re.sub(make_re('xzzx', '\\d'), 'abba', 'abba1221b99999889')
print re.sub(make_re('xyzxyz', '[a-z]'), '123123', 'abcabc zyxzyyx zyzzyz')
Outputs:
abbaabbab9999abba
123123 zyxzyyx zyzzyz

How to make a regular expression 'greedy but optional'

I'm trying to write a parser for a string which represents a file path, optionally following by a colon (:) and a string representing access flags (e.g. r+ or w). The file name can itself contain colons, e.g., foo:bar.txt, so the colon separating the access flags should be the last colon in the string.
Here is my implementation so far:
import re
def parse(string):
SCHEME = r"file://" # File prefix
PATH_PATTERN = r"(?P<path>.+)" # One or more of any character
FLAGS_PATTERN = r"(?P<flags>.+)" # The letters r, w, a, b, a '+' symbol, or any digit
# FILE_RESOURCE_PATTERN = SCHEME + PATH_PATTERN + r":" + FLAGS_PATTERN + r"$" # This makes the first test pass, but the second one fail
FILE_RESOURCE_PATTERN = SCHEME + PATH_PATTERN + optional(r":" + FLAGS_PATTERN) + r"$" # This makes the second test pass, but the first one fail
tokens = re.match(FILE_RESOURCE_PATTERN, string).groupdict()
return tokens['path'], tokens['flags']
def optional(re):
'''Encloses the given regular expression in a group which matches 0 or 1 repetitions.'''
return '({})?'.format(re)
I've tried the following tests:
import pytest
def test_parse_file_with_colon_in_file_name():
assert parse("file://foo:bar.txt:r+") == ("foo:bar.txt", "r+")
def test_parse_file_without_acesss_flags():
assert parse("file://foobar.txt") == ("foobar.txt", None)
if __name__ == "__main__":
pytest.main([__file__])
The problem is that by either using or not using optional, I can make one or the other test pass, but not both. If I make r":" + FLAGS_PATTERN optional, then preceding regular expression consumes the entire string.
How can I adapt the parse method to make both tests pass?
You should build the regex like
^file://(?P<path>.+?)(:(?P<flags>[^:]+))?$
See the regex demo.
In your code, ^ anchor is not necessary as you are using re.match anchoring the match at the start of the string. The path group matches any 1+ chars lazily (thus, all the text that can be matched with Group 2 will land in the second capture), up to the first occurrence of : followed with 1+ chars other than : (if present) and then end of string position is tested. Thanks to $ anchor, the first group will match the whole string if the second optional group is not matched.
Use the following fix:
PATH_PATTERN = r"(?P<path>.+?)" # One or more of any character
FLAGS_PATTERN = r"(?P<flags>[^:]+)" # The letters r, w, a, b, a '+' symbol, or any digit
See the online Python demo.
Just for fun, I wrote this parse function, which I think is better than using RE?
def parse(string):
s = string.split('//')[-1]
try:
path, flags = s.rsplit(':', 1)
except ValueError:
path, flags = s.rsplit(':', 1)[0], None
return path, flags

String substitutions based on the matching object (Python)

I struggle to understand the group method in Python's regular expressions library. In this context, I try to do substitutions on a string depending on the matching object.
That is, I want to replace the matched objects (+ and \n in this example) with a particular string in the my_dict dictionary (with rep1 and rep2 respectively).
As seen from this question and answer,
I have tried this:
content = '''
Blah - blah \n blah * blah + blah.
'''
regex = r'[+\-*/]'
for mobj in re.finditer(regex, content):
t = mobj.lastgroup
v = mobj.group(t)
new_content = re.sub(regex, repl_func(mobj), content)
def repl_func(mobj):
my_dict = { '+': 'rep1', '\n': 'rep2'}
try:
match = mobj.group(0)
except AttributeError:
match = ''
else:
return my_dict.get(match, '')
print(new_content)
But I get None for t followed by an IndexError when computing v.
Any explanations and example code would be appreciated.
Despite of Wiktor's truly pythonic answer, there's still the question why the OP's orginal algorithm wouldn't work.
Basically there are 2 issues:
The call of new_content = re.sub(regex, repl_func(mobj), content) will substitute all matches of regex with the replacement value of the very first match.
The correct call has to be new_content = re.sub(regex, repl_func, content).
As documented here, repl_func gets invoked dynamically with the current match object!
repl_func(mobj) does some unnecessary exception handling, which can be simplified:
my_dict = {'\n': '', '+':'rep1', '*':'rep2', '/':'rep3', '-':'rep4'}
def repl_func(mobj):
global my_dict
return my_dict.get(mobj.group(0), '')
This is equivalent to Wiktor's solution - he just got rid of the function definition itself by using a lambda expression.
With this modification, the for mobj in re.finditer(regex, content): loop has become superfluos, as it does the same calculation multiple times.
Just for the sake of completeness here is a working solution using re.finditer(). It builds the result string from the matched slices of content:
my_regx = r'[\n+*/-]'
my_dict = {'\n': '', '+':'rep1' , '*':'rep2', '/':'rep3', '-':'rep4'}
content = "A*B+C-D/E"
res = ""
cbeg = 0
for mobj in re.finditer(my_regx, content):
# get matched string and its slice indexes
mstr = mobj.group(0)
mbeg = mobj.start()
mend = mobj.end()
# replace matched string
mrep = my_dict.get(mstr, '')
# append non-matched part of content plus replacement
res += content[cbeg:mbeg] + mrep
# set new start index of remaining slice
cbeg = mend
# finally add remaining non-matched slice
res += content[cbeg:]
print (res)
The r'[+\-*/]' regex does not match a newline, so your '\n': 'rep2' would not be used. Else, add \n to the regex: r'[\n+*/-]'.
Next, you get None because your regex does not contain any named capturing groups, see re docs:
match.lastgroup
The name of the last matched capturing group, or None if the group didn’t have a name, or if no group was matched at all.
To replace using the match, you do not even need to use re.finditer, use re.sub with a lambda as the replacement:
import re
content = '''
Blah - blah \n blah * blah + blah.
'''
regex = r'[\n+*/-]'
my_dict = { '+': 'rep1', '\n': 'rep2'}
new_content = re.sub(regex, lambda m: my_dict.get(m.group(),""), content)
print(new_content)
# => rep2Blah blah rep2 blah blah rep1 blah.rep2
See the Python demo
The m.group() gets the whole match (the whole match is stored in match.group(0)). If you had a pair of unescaped parentheses in the pattern, it would create a capturing group and you could access the first one with m.group(1), etc.

Python test if string matches a template value

I am trying to iterate through a list of strings, keeping only those that match a naming template I have specified. I want to accept any list entry that matches the template exactly, other than having an integer in a variable <SCENARIO> field.
The check needs to be general. Specifically, the string structure could change such that there is no guarantee <SCENARIO> always shows up at character X (to use list comprehensions, for example).
The code below shows an approach that works using split, but there must be a better way to make this string comparison. Could I use regular expressions here?
template = 'name_is_here_<SCENARIO>_20131204.txt'
testList = ['name_is_here_100_20131204.txt', # should accept
'name_is_here_100_20131204.txt.NEW', # should reject
'other_name.txt'] # should reject
acceptList = []
for name in testList:
print name
acceptFlag = True
splitTemplate = template.split('_')
splitName = name.split('_')
# if lengths do not match, name cannot possibly match template
if len(splitTemplate) == len(splitName):
print zip(splitTemplate, splitName)
# compare records in the split
for t, n in zip(splitTemplate, splitName):
if t!=n and not t=='<SCENARIO>':
#reject if any of the "other" fields are not identical
#(would also check that '<SCENARIO>' field is numeric - not shown here)
print 'reject: ' + name
acceptFlag = False
else:
acceptFlag = False
# keep name if it passed checks
if acceptFlag == True:
acceptList.append(name)
print acceptList
# correctly prints --> ['name_is_here_100_20131204.txt']
Try with the re module for regular expressions in Python:
import re
template = re.compile(r'^name_is_here_(\d+)_20131204.txt$')
testList = ['name_is_here_100_20131204.txt', #accepted
'name_is_here_100_20131204.txt.NEW', #rejected!
'name_is_here_aabs2352_20131204.txt', #rejected!
'other_name.txt'] #rejected!
acceptList = [item for item in testList if template.match(item)]
This should do, I understand that name_is_here is just a placeholder for alphanumeric characters?
import re
testList = ['name_is_here_100_20131204.txt', # should accept
'name_is_here_100_20131204.txt.NEW', # should reject
'other_name.txt',
'name_is_44ere_100_20131204.txt',
'name_is_here_100_2013120499.txt',
'name_is_here_100_something_2013120499.txt',
'name_is_here_100_something_20131204.txt']
def find(scenario):
begin = '[a-z_]+100_' # any combinations of chars and underscores followd by 100
end = '_[0-9]{8}.txt$' #exactly eight digits followed by .txt at the end
pattern = re.compile("".join([begin,scenario,end]))
result = []
for word in testList:
if pattern.match(word):
result.append(word)
return result
find('something') # returns ['name_is_here_100_something_20131204.txt']
EDIT: scenario in separate variable, regex now only matches characters followed by 100, then scenarion, then eight digits followed by .txt.

Python Regex not matching at start of string?

I'm going through a binary file with regexes extracting data, and I'm having a problem with regex I can't track down.
This is the code I'm having issues with:
z = 0
for char in string:
self.response.out.write('|%s' % char.encode('hex'))
z+=1
if z > 20:
self.response.out.write('<br>')
break
title = []
string = re.sub('^\x72.([^\x7A]+)', lambda match: append_match(match, title), string, 1)
print_info('Title', title)
def append_match(match, collection, replace = ''):
collection.append(match.group(1))
return replace
This is the content of the first 20 chars in string when this runs:
|72|0a|50|79|72|65|20|54|72|6f|6c|6c|7a|19|54|72|6f|6c|6c|62|6c
It returns nothing, except if I remove the ^, in which case it returns "Troll" (not the quotes) which is 54726F6C6C. It should be returning everything up to the \x7a as I read it.
What's going on here?
The problem is that \x0A (=newline) won't be matched by the dot by default. Try adding the dotall flag to your pattern, for example:
re.sub('(?s)^\x72.([^\x7A]+)....

Categories

Resources