How to make a regular expression 'greedy but optional'

How to make a regular expression 'greedy but optional' - python

I'm trying to write a parser for a string which represents a file path, optionally following by a colon (:) and a string representing access flags (e.g. r+ or w). The file name can itself contain colons, e.g., foo:bar.txt, so the colon separating the access flags should be the last colon in the string.
Here is my implementation so far:
import re
def parse(string):
SCHEME = r"file://" # File prefix
PATH_PATTERN = r"(?P<path>.+)" # One or more of any character
FLAGS_PATTERN = r"(?P<flags>.+)" # The letters r, w, a, b, a '+' symbol, or any digit
# FILE_RESOURCE_PATTERN = SCHEME + PATH_PATTERN + r":" + FLAGS_PATTERN + r"$" # This makes the first test pass, but the second one fail
FILE_RESOURCE_PATTERN = SCHEME + PATH_PATTERN + optional(r":" + FLAGS_PATTERN) + r"$" # This makes the second test pass, but the first one fail
tokens = re.match(FILE_RESOURCE_PATTERN, string).groupdict()
return tokens['path'], tokens['flags']
def optional(re):
'''Encloses the given regular expression in a group which matches 0 or 1 repetitions.'''
return '({})?'.format(re)
I've tried the following tests:
import pytest
def test_parse_file_with_colon_in_file_name():
assert parse("file://foo:bar.txt:r+") == ("foo:bar.txt", "r+")
def test_parse_file_without_acesss_flags():
assert parse("file://foobar.txt") == ("foobar.txt", None)
if __name__ == "__main__":
pytest.main([__file__])
The problem is that by either using or not using optional, I can make one or the other test pass, but not both. If I make r":" + FLAGS_PATTERN optional, then preceding regular expression consumes the entire string.
How can I adapt the parse method to make both tests pass?

You should build the regex like
^file://(?P<path>.+?)(:(?P<flags>[^:]+))?$
See the regex demo.
In your code, ^ anchor is not necessary as you are using re.match anchoring the match at the start of the string. The path group matches any 1+ chars lazily (thus, all the text that can be matched with Group 2 will land in the second capture), up to the first occurrence of : followed with 1+ chars other than : (if present) and then end of string position is tested. Thanks to $ anchor, the first group will match the whole string if the second optional group is not matched.
Use the following fix:
PATH_PATTERN = r"(?P<path>.+?)" # One or more of any character
FLAGS_PATTERN = r"(?P<flags>[^:]+)" # The letters r, w, a, b, a '+' symbol, or any digit
See the online Python demo.

Just for fun, I wrote this parse function, which I think is better than using RE?
def parse(string):
s = string.split('//')[-1]
try:
path, flags = s.rsplit(':', 1)
except ValueError:
path, flags = s.rsplit(':', 1)[0], None
return path, flags

Related

regex: pattern fails to match what I am looking for

I have the following code that tries to retrieve the name of a file from a directory based on a double \ character:
import re
string = 'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
pattern = r'(?<=*\\\\)*'
re.findall(pattern,string)
The reasoning behind is that the name of the file is always after a double \ , so I try to look any string which is preceeded by any text that finishes with \ .
Neverthless, when I apply this code I get the following error:
error: nothing to repeat at position 4
What am I doing wrong?
Edit: The concrete output I am looking for is getting the string 'abo_st_gas_dtd_csv' as a match.

There's a couple of things going on:
You need to declare your string definition using the same r'string' notation as for the pattern; right now your string only has a single backslash, since the first one of the two is escaped.
I'm not sure you're using * correctly. It means "repeat immediately preceding group", and not just "any string" (as, e.g., in the usual shell patterns). The first * in parentheses does not have anything preceding it, meaning that the regex is invalid. Hence the error you see. I think, what you want is .*, i.e., repeating any character 0 or more times. Furthermore, it is not needed in the parentheses. A more correct regexp would be r'(?<=\\\\).*':
import re
string = r'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
pattern = r'(?<=\\\\).*'
re.findall(pattern,string)

Your pattern is just a lookabehind, which, by itself, can't match anything. I would use this re.findall approach:
string = 'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
filename = re.findall(r'\\([^.]+\.\w+)$', string)[0]
print(filename) # abo_st_gas_dtd.csv

files = 'I:E\\trm.csvest/PZMALIo4\ETRM841_FX_.csvDeals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
counter = -1
my_files = []
for f in files:
counter += 1
if ord(f) == 92:#'\'
temp = files[counter+1:len(files)]
temp_file = ""
for f1 in temp:
temp_file += f1
# [0-len(temp_file)] => if [char after . to num index of type file]== csv
if f1 == '.' and temp[len(temp_file):len(temp_file)+3] == "csv":
my_files.append(temp_file + "csv")
break
print(my_files)#['trm.csv', 'ETRM841_FX_.csv', 'abo_st_gas_dtd.csv']

Python Regex Check for Credit Cards

I created a script to look for specific card numbers, of which, are included in the list credit_cards. Simple function, it just marks each one as Valid/Invalid depending on the regex pattern listed.
My problem stems from understanding how to implement this regex check, with the inclusion of spaces, and periods. So if a card were to have 3423.3423.2343.3433 or perhaps 3323 3223 442 3234. I do include hyphens as a delimiter I'm checking for, but how can I also include multiple delimeters, like periods and spaces?
Here is my script-
import re
credit_cards = ['6011346728478930','5465625879615786','5711424424442444',
'5000-2368-7954-3214', '4444444444444444', '5331625879615786', '5770625879615786',
'5750625879615786', '575455879615786']
def World_BINS(credit_cards):
valid_BINS = r"^5(?:465|331|000|[0-9]{2})(-?\d{4}){3}$"
do_not_repeat = r"((\d)-?(?!(-?\2){3})){16}"
filters = valid_BINS, do_not_repeat
for num in credit_cards:
if all(re.match(f, num) for f in filters):
print(f"{num} is Valid")
else:
print (f"{num} is invalid")
World_BINS(credit_cards)

You can use
import re
credit_cards = ['5000 2368 7954 3214','5000.2368.7954.3214','6011346728478930','5465625879615786', '5711424424442444', '5000-2368-7954-3214', '4444444444444444', '5331625879615786', '5770625879615786','5750625879615786', '575455879615786']
def World_BINS(credit_cards):
valid_BINS = r"^5(?:465|331|000|[0-9]{2})(?=([\s.-]?))(\1\d{4}){3}$"
do_not_repeat = r"^((\d)([\s.-]?)(?!(\3?\2){3})){16}$"
filters = [valid_BINS, do_not_repeat]
for num in credit_cards:
if all(re.match(f, num) for f in filters):
print(f"{num} is Valid")
else:
print (f"{num} is invalid")
World_BINS(credit_cards)
See the Python demo.
The (?=([\s.-]?))(\1\d{4}){3} in the first regex captures a whitespace (\s), . or - as an optional char into Group 1 and then \1 refers to the value (either an empty string, or whitespace, or . or -) in the next group. The lookaround is used to make sure the delimiters / separators are used consistently in the string.
In the second regex, ^((\d)([\s.-]?)(?!(\3?\2){3})){16}$, similar technique is used, the whitespace, . or - is captured into Group 3, and the char is optionally matched inside the subsequent quantified group to refer to the same value.

Change Python script parameters from another Python script

I have a main-script from which I want to make a couple of temporary copies, with slight changes in each of the copies.
main.py could look like this:
import NumPy as np
import module
import bar
...
foo = bar(label='C2H4', point=(1,0))
atoms = 'H4'
template = read('template.t')
size = template.lengths()
n = 4
alpha = 0.5
batch_size = 256 // (n * alpha)
dct = {
'1': [1, 2],
'6': [3, 4],
}
kwargs = {
'dict': dct,
'size': size,
'scale': size[0] / 10,
}
...
module(atoms, kwargs, foo)
module.run()
In another script, called parameter_search.py, I make the copy and change the parameters by running through each line in main.py, searching for what should be changed.
If the variable is found, a regex-command splits the line, and then changes the float (I'm not the best at regexes, so this could probably use some work):
import re
import fileinput
import os from shutil import copyfile
def is_num(var):
try:
float(var)
return True
except ValueError:
return False
def replace(filename, var_expr, new_val):
found = False
for line in fileinput.input(filename, inplace=1):
if var_expr in line:
if not found:
found = True
lst = re.split('(=|:|,)', line)
for i, char in enumerate(lst):
if is_num(char):
lst[i] = str(new_val)
line = "".join(lst)
else:
raise NameError(f'{var_expr} ambigous')
print(line, end="")
if not found:
raise NameError(f'{var_expr} not found in {filename}')
N = [10, 20]
alpha = [0.4, 0.6, 0.8]
foo = [bar(alpha=1), baz()]
for n in N:
for a in alpha:
for i, f in enumerate(foo):
newfile = 'main'
newfile += f'_n{n}'
newfile += f'_a{a}'
newfile += f'_f{i}'
newfile += '.py'
copyfile('main.py', newfile)
replace(newfile, 'n=', n)
replace(newfile, 'alpha=', use_n)
replace(newfile, 'foo = ', f)
This gives decent results, but problems arise if several variables are on the same line, such as bar(label='C2H4', point=(1, 0)) or the variable is a part of a dict as kwargs, the parameter is a string, a function or some other weird variable.
Is it possible to make something like replace() that is more general or in some other way makes this possible?

Okay, let me make a few assumptions here:
The parameters assignment happens only once in the .py script and it is also the first occurence of the parameter name.
There are exactly two ways to declare a parameter: as a single variable par = value or inside a dictionary {'par' : value} (with single or double quotes).
You can then use the re.sub function to directly substitute the value of the assignment, using what is called capture groups:
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of
pattern in string by the replacement repl. If the pattern isn’t found,
string is returned unchanged. repl can be a string or a function; if
it is a string, any backslash escapes in it are processed. That is, \n
is converted to a single newline character, \r is converted to a
carriage return, and so forth. Unknown escapes of ASCII letters are
reserved for future use and treated as errors. Other unknown escapes
such as \& are left alone. Backreferences, such as \6, are replaced
with the substring matched by group 6 in the pattern.
source: https://docs.python.org/3/library/re.html
So, your function to set a new value for a given parameter name could look like this:
def replace_parameter_value_if_found(string, parameter, new_value):
return new_string = re.sub("([\'\"]?"+parameter+"[\'\"]?\s*[=:]\s*)[\.\w]*", "\1"+new_value, string)
Now, let's break it down:
"([\'\"]?"+parameter+"[\'\"]?\s*[=:]\s*)[.\w]*"
Stuff enclosed in () is called a capture group - it can be
referenced later.
Stuff enclosed in [] matches any of those characters inside (the backslash serves as an escape character)
? is a quantifier matching zero or one occurrence of what it immediately succeeds
* is a quantifier matching any number of occurrences of what it immediately succeeds, including zero
any literal string matches that string
\s and \w match any whitespace and any word character (a-z0-9), respectively
So, say the parameter was 'alpha', the regex pattern becomes
"([\'\"]?alpha[\'\"]?\s*[=:]\s*)[\.\w]*"
and reads like this:
Open capture group 1
Match a single ' or " or neither (because of the ?)
Match the literal word alpha
Match a single ' or " or neither
Match any number of white space characters
Match a single = or :
Close capture group 1 (it now contains alpha= or 'alpha': or some variation of it)
Match any number of word characters or periods (this is what we will be replacing)
All this will then be replaced with what's in capture group 1, followed by the new value, hence:
"\1"+new_value
The string can be the entirety of the .py script, also keep in mind that what you are passing to the function are strings and they can be whatever you want.
Hope this helps.

How can I replace part of a string with a pattern

for example is the string is "abbacdeffel" and the pattern being "xyyx" replaced with "1234"
so it would result from "abbacdeffel" to "1234cd1234l"
I have tried to think this out but I couldnt come up with anything. At first I thought maybe dictionary could help but still nothing came to mind.

What you're looking to do can be accomplished by using regex, or more commonly known as, Regular Expressions. Regular Expressions in programming enables you to extract what you want and just what you want from a string.
In your case, you want to match the string with the pattern abba so using the following regex:
(\w+)(\w+)\2\1
https://regex101.com/r/hP8lA3/1
You can match two word groups and use backreferences to make sure that the second group comes first, then the first group.
So implementing this in python code looks like this:
First, import the regex module in python
import re
Then, declare your variable
text = "abbacdeffel"
The re.finditer returns an iterable so you can iterate through all the groups
matches = re.finditer(r"(\w)(\w)\2\1", text)
Go through all the matches that the regexp found and replace the pattern with "1234"
for match in matches:
text = text.replace(match.group(0), "1234")
For debugging:
print(text)
Complete Code:
import re
text = "abbacdeffel"
matches = re.finditer(r"(\w)(\w)\2\1", text)
for match in matches:
text = text.replace(match.group(0), "1234")
print(text)
You can learn more about Regular Expressions here: https://regexone.com/references/python

New version of code (there was a bug):
def replace_with_pattern(pattern, line, replace):
from collections import OrderedDict
set_of_chars_in_pattern = set(pattern)
indice_start_pattern = 0
output_line = ""
while indice_start_pattern < len(line):
potential_end_pattern = indice_start_pattern + len(pattern)
subline = line[indice_start_pattern:potential_end_pattern]
print(subline)
set_of_chars_in_subline = set(subline)
if len(set_of_chars_in_subline)!= len(set_of_chars_in_pattern):
output_line += line[indice_start_pattern]
indice_start_pattern +=1
continue
map_of_chars = OrderedDict()
liste_of_chars_in_pattern = []
for char in pattern:
if char not in liste_of_chars_in_pattern:
liste_of_chars_in_pattern.append(char)
print(liste_of_chars_in_pattern)
for subline_char in subline:
if subline_char not in map_of_chars.values():
map_of_chars[liste_of_chars_in_pattern.pop(0)] =subline_char
print(map_of_chars)
wanted_subline = ""
for char_of_pattern in pattern:
wanted_subline += map_of_chars[char_of_pattern]
print("wanted_subline =" + wanted_subline)
if subline == wanted_subline:
output_line += replace
indice_start_pattern += len(pattern)
else:
output_line += line[indice_start_pattern]
indice_start_pattern += 1
return output_line
some test :
test1 = replace_with_pattern("xyyx", "abbacdeffel", "1234")
test2 = replace_with_pattern("abbacdeffel", "abbacdeffel", "1234")
print(test1, test2)
=> 1234cd1234l 1234

Here goes my attempt:
([a-zA-Z])(?!\1)([a-zA-Z])\2\1
Assuming you want to match letters only (if other ranges, change both [a-zA-Z] as appropriate, we have:
([a-zA-Z])
Find the first character, and note it so we can later refer to it with \1.
(?!\1)
Check to see if the next character is not the same as the first, but without advancing the search pointer. This is to prevent aaaa being accepted. If aaaa is OK, just remove this subexpression.
([a-zA-Z])
Find the second character, and note it so we can later refer to it with \2.
\2\1
Now find the second again, then the first again, so we match the full abba pattern.
And finally, to do a replace operation, the full command would be:
import re
re.sub(r'([a-zA-Z])(?!\1)([a-zA-Z])\2\1',
'1234',
'abbacdeffelzzzz')
The r at the start of the regex pattern is to prevent Python processing the backslashes. Without it, you would need to do:
import re
re.sub('([a-zA-Z])(?!\\1)([a-zA-Z])\\2\\1',
'1234',
'abbacdeffelzzzz')
Now, I see the spec has expanded to a user-defined pattern; here is some code that will build that pattern:
import re
def make_re(pattern, charset):
result = ''
seen = []
for c in pattern:
# Is this a letter we've seen before?
if c in seen:
# Yes, so we want to match the captured pattern
result += '\\' + str(seen.index(c)+1)
else:
# No, so match a new character from the charset,
# but first exclude already matched characters
for i in xrange(len(seen)):
result += '(?!\\' + str(i + 1) + ')'
result += '(' + charset + ')'
# Note we have seen this letter
seen.append(c)
return result
print re.sub(make_re('xzzx', '\\d'), 'abba', 'abba1221b99999889')
print re.sub(make_re('xyzxyz', '[a-z]'), '123123', 'abcabc zyxzyyx zyzzyz')
Outputs:
abbaabbab9999abba
123123 zyxzyyx zyzzyz

regex search in python

I have a python program that searches through a file for valid phone numbers according to a regex peattern. It then, if it finds a match, parses the number out and prints it on the screen. I want to modify it to make it recognize an extension if there is one. I added in a second pattern (patStringExten) but I am unsure how to make it parse out the extension. Any help with this would be greatly appreciated!
import sys
import re
DEF_A_CODE = "None"
def usage() :
print "Usage:"
print "\t" + sys.argv[0] + " [<file>]"
def searchFile( fileName, pattern ) :
fh = open( fileName, "r" )
for l in fh :
l = l.strip()
# Here's the actual search
match = pattern.search( l )
if match :
nr = match.groups()
# Note, from the pattern, that 0 may be null, but 1 and 2 must exist
if not nr[0] :
aCode = DEF_A_CODE
else :
aCode = nr[0]
print "area code: " + aCode + \
", exchange: " + nr[1] + ", trunk: " + nr[2]+ ", extension: " + nr[3]
else :
print "NO MATCH: " + l
fh.close()
def main() :
# stick filename
if len( sys.argv ) < 2 : # no file name
# assume telNrs.txt
fileName = "telNrs.txt"
else :
fileName = sys.argv[1]
# for legibility, Python supplies a 'verbose' pattern
# requires a special flag
#patString = '(\d{3})*[ .\-)]*(\d{3})[ .\-]*(\d{4})'
patString = r'''
# don't match beginning of string (takes care of 1-)
(\d{3})? # area code (3 digits) (optional)
[ .\-)]* # optional separator (any # of space, dash, or dot,
# or closing ')' )
(\d{3}) # exchange, 3 digits
[ .\-]* # optional separator (any # of space, dash, or dot)
(\d{4}) # number, 4 digits
'''
patStringExten = r'''
# don't match beginning of string (takes care of 1-)
(\d{3})? # area code (3 digits) (optional)
[ .\-)]* # optional separator (any # of space, dash, or dot,
# or closing ')' )
(\d{3}) # exchange, 3 digits
[ .\-]* # optional separator (any # of space, dash, or dot)
(\d{4}) # number, 4 digits
[ .\-x]*
[0-9]{1,4}
'''
# Here is what the pattern would look like as a regular pattern:
#patString = r'(\d{3})\D*(\d{3})\D*(\d{4})'
# Instead of creating a temporary object each time, we will compile this
# regexp once, and store this object
pattern = re.compile( patString, re.VERBOSE )
searchFile( fileName, pattern )
main()

I'm not sure what you're asking, but I'm going to take a guess.
First, your code is ignoring the new pattern you created. If you want to actually use that patStringExten pattern instead of the patString pattern, you have to pass it to the compile call:
pattern = re.compile(patStringExten, re.VERBOSE)
But if you do that, the matches still only have 3 groups, not 4. Why? Because you didn't put grouping parentheses around the extension. To fix that, just put them in: change [0-9]{1,4} to ([0-9]{1,4}).
And meanwhile, now you're only matching phone numbers with extensions, not both with and without. You could of course fix that by looping over the two patterns and doing the same thing for each, but it's probably better to merge them into one pattern, by making the last group optional. (You might want to make the last two lines, not just the last group, optional… but since the penultimate line is already a 0-or-more match, it's the same either way.) So, change that ([0-9]{1,4}) to ([0-9]{1,4})?.
Now your groups will have 4 elements instead of 3, so your existing code that tries to print nr[3] will print the extension (or None if the optional part was missing) instead of raising an IndexError.
But really, it's probably cleaner to rewrite the output with string formatting. For example:
if nr[3]:
print "area code: {}, exchange: {}, trunk: {}, ext: {}".format(
aCode, nr[1], nr[2], nr[3])
else:
print "area code: {}, exchange: {}, trunk: {}".format(
aCode, nr[1], nr[2])
Rather than show the whole thing put together in code, seeing the pattern on Debuggex seems more useful, so you can see how it works visually (try it against different strings, to make sure it matches everything you want the way you want it):
# don't match beginning of string (takes care of 1-)
(\d{3})? # area code (3 digits) (optional)
[ .\-)]* # optional separator (any # of space, dash, or dot,
# or closing ')' )
(\d{3}) # exchange, 3 digits
[ .\-]* # optional separator (any # of space, dash, or dot)
(\d{4}) # number, 4 digits
[ .\-x]*
([0-9]{1,4})?
Debuggex Demo

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to make a regular expression 'greedy but optional' - python

Just for fun, I wrote this parse function, which I think is better than using RE? def parse(string): s = string.split('//')[-1] try: path, flags = s.rsplit(':', 1) except ValueError: path, flags = s.rsplit(':', 1)[0], None return path, flags

Related

regex: pattern fails to match what I am looking for

Python Regex Check for Credit Cards

Change Python script parameters from another Python script

How can I replace part of a string with a pattern

regex search in python

Categories

Resources