regex search in python - python

I have a python program that searches through a file for valid phone numbers according to a regex peattern. It then, if it finds a match, parses the number out and prints it on the screen. I want to modify it to make it recognize an extension if there is one. I added in a second pattern (patStringExten) but I am unsure how to make it parse out the extension. Any help with this would be greatly appreciated!
import sys
import re
DEF_A_CODE = "None"
def usage() :
print "Usage:"
print "\t" + sys.argv[0] + " [<file>]"
def searchFile( fileName, pattern ) :
fh = open( fileName, "r" )
for l in fh :
l = l.strip()
# Here's the actual search
match = pattern.search( l )
if match :
nr = match.groups()
# Note, from the pattern, that 0 may be null, but 1 and 2 must exist
if not nr[0] :
aCode = DEF_A_CODE
else :
aCode = nr[0]
print "area code: " + aCode + \
", exchange: " + nr[1] + ", trunk: " + nr[2]+ ", extension: " + nr[3]
else :
print "NO MATCH: " + l
fh.close()
def main() :
# stick filename
if len( sys.argv ) < 2 : # no file name
# assume telNrs.txt
fileName = "telNrs.txt"
else :
fileName = sys.argv[1]
# for legibility, Python supplies a 'verbose' pattern
# requires a special flag
#patString = '(\d{3})*[ .\-)]*(\d{3})[ .\-]*(\d{4})'
patString = r'''
# don't match beginning of string (takes care of 1-)
(\d{3})? # area code (3 digits) (optional)
[ .\-)]* # optional separator (any # of space, dash, or dot,
# or closing ')' )
(\d{3}) # exchange, 3 digits
[ .\-]* # optional separator (any # of space, dash, or dot)
(\d{4}) # number, 4 digits
'''
patStringExten = r'''
# don't match beginning of string (takes care of 1-)
(\d{3})? # area code (3 digits) (optional)
[ .\-)]* # optional separator (any # of space, dash, or dot,
# or closing ')' )
(\d{3}) # exchange, 3 digits
[ .\-]* # optional separator (any # of space, dash, or dot)
(\d{4}) # number, 4 digits
[ .\-x]*
[0-9]{1,4}
'''
# Here is what the pattern would look like as a regular pattern:
#patString = r'(\d{3})\D*(\d{3})\D*(\d{4})'
# Instead of creating a temporary object each time, we will compile this
# regexp once, and store this object
pattern = re.compile( patString, re.VERBOSE )
searchFile( fileName, pattern )
main()

I'm not sure what you're asking, but I'm going to take a guess.
First, your code is ignoring the new pattern you created. If you want to actually use that patStringExten pattern instead of the patString pattern, you have to pass it to the compile call:
pattern = re.compile(patStringExten, re.VERBOSE)
But if you do that, the matches still only have 3 groups, not 4. Why? Because you didn't put grouping parentheses around the extension. To fix that, just put them in: change [0-9]{1,4} to ([0-9]{1,4}).
And meanwhile, now you're only matching phone numbers with extensions, not both with and without. You could of course fix that by looping over the two patterns and doing the same thing for each, but it's probably better to merge them into one pattern, by making the last group optional. (You might want to make the last two lines, not just the last group, optional… but since the penultimate line is already a 0-or-more match, it's the same either way.) So, change that ([0-9]{1,4}) to ([0-9]{1,4})?.
Now your groups will have 4 elements instead of 3, so your existing code that tries to print nr[3] will print the extension (or None if the optional part was missing) instead of raising an IndexError.
But really, it's probably cleaner to rewrite the output with string formatting. For example:
if nr[3]:
print "area code: {}, exchange: {}, trunk: {}, ext: {}".format(
aCode, nr[1], nr[2], nr[3])
else:
print "area code: {}, exchange: {}, trunk: {}".format(
aCode, nr[1], nr[2])
Rather than show the whole thing put together in code, seeing the pattern on Debuggex seems more useful, so you can see how it works visually (try it against different strings, to make sure it matches everything you want the way you want it):
# don't match beginning of string (takes care of 1-)
(\d{3})? # area code (3 digits) (optional)
[ .\-)]* # optional separator (any # of space, dash, or dot,
# or closing ')' )
(\d{3}) # exchange, 3 digits
[ .\-]* # optional separator (any # of space, dash, or dot)
(\d{4}) # number, 4 digits
[ .\-x]*
([0-9]{1,4})?
Debuggex Demo

Related

regex: pattern fails to match what I am looking for

I have the following code that tries to retrieve the name of a file from a directory based on a double \ character:
import re
string = 'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
pattern = r'(?<=*\\\\)*'
re.findall(pattern,string)
The reasoning behind is that the name of the file is always after a double \ , so I try to look any string which is preceeded by any text that finishes with \ .
Neverthless, when I apply this code I get the following error:
error: nothing to repeat at position 4
What am I doing wrong?
Edit: The concrete output I am looking for is getting the string 'abo_st_gas_dtd_csv' as a match.
There's a couple of things going on:
You need to declare your string definition using the same r'string' notation as for the pattern; right now your string only has a single backslash, since the first one of the two is escaped.
I'm not sure you're using * correctly. It means "repeat immediately preceding group", and not just "any string" (as, e.g., in the usual shell patterns). The first * in parentheses does not have anything preceding it, meaning that the regex is invalid. Hence the error you see. I think, what you want is .*, i.e., repeating any character 0 or more times. Furthermore, it is not needed in the parentheses. A more correct regexp would be r'(?<=\\\\).*':
import re
string = r'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
pattern = r'(?<=\\\\).*'
re.findall(pattern,string)
Your pattern is just a lookabehind, which, by itself, can't match anything. I would use this re.findall approach:
string = 'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
filename = re.findall(r'\\([^.]+\.\w+)$', string)[0]
print(filename) # abo_st_gas_dtd.csv
files = 'I:E\\trm.csvest/PZMALIo4\ETRM841_FX_.csvDeals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
counter = -1
my_files = []
for f in files:
counter += 1
if ord(f) == 92:#'\'
temp = files[counter+1:len(files)]
temp_file = ""
for f1 in temp:
temp_file += f1
# [0-len(temp_file)] => if [char after . to num index of type file]== csv
if f1 == '.' and temp[len(temp_file):len(temp_file)+3] == "csv":
my_files.append(temp_file + "csv")
break
print(my_files)#['trm.csv', 'ETRM841_FX_.csv', 'abo_st_gas_dtd.csv']

Python Regex Check for Credit Cards

I created a script to look for specific card numbers, of which, are included in the list credit_cards. Simple function, it just marks each one as Valid/Invalid depending on the regex pattern listed.
My problem stems from understanding how to implement this regex check, with the inclusion of spaces, and periods. So if a card were to have 3423.3423.2343.3433 or perhaps 3323 3223 442 3234. I do include hyphens as a delimiter I'm checking for, but how can I also include multiple delimeters, like periods and spaces?
Here is my script-
import re
credit_cards = ['6011346728478930','5465625879615786','5711424424442444',
'5000-2368-7954-3214', '4444444444444444', '5331625879615786', '5770625879615786',
'5750625879615786', '575455879615786']
def World_BINS(credit_cards):
valid_BINS = r"^5(?:465|331|000|[0-9]{2})(-?\d{4}){3}$"
do_not_repeat = r"((\d)-?(?!(-?\2){3})){16}"
filters = valid_BINS, do_not_repeat
for num in credit_cards:
if all(re.match(f, num) for f in filters):
print(f"{num} is Valid")
else:
print (f"{num} is invalid")
World_BINS(credit_cards)
You can use
import re
credit_cards = ['5000 2368 7954 3214','5000.2368.7954.3214','6011346728478930','5465625879615786', '5711424424442444', '5000-2368-7954-3214', '4444444444444444', '5331625879615786', '5770625879615786','5750625879615786', '575455879615786']
def World_BINS(credit_cards):
valid_BINS = r"^5(?:465|331|000|[0-9]{2})(?=([\s.-]?))(\1\d{4}){3}$"
do_not_repeat = r"^((\d)([\s.-]?)(?!(\3?\2){3})){16}$"
filters = [valid_BINS, do_not_repeat]
for num in credit_cards:
if all(re.match(f, num) for f in filters):
print(f"{num} is Valid")
else:
print (f"{num} is invalid")
World_BINS(credit_cards)
See the Python demo.
The (?=([\s.-]?))(\1\d{4}){3} in the first regex captures a whitespace (\s), . or - as an optional char into Group 1 and then \1 refers to the value (either an empty string, or whitespace, or . or -) in the next group. The lookaround is used to make sure the delimiters / separators are used consistently in the string.
In the second regex, ^((\d)([\s.-]?)(?!(\3?\2){3})){16}$, similar technique is used, the whitespace, . or - is captured into Group 3, and the char is optionally matched inside the subsequent quantified group to refer to the same value.

How to remove certain characters in a string without using replace or string.split() in python

I am using a file name as a dictionary key.
key_dict = {employee_details.csv : ["name","address","phone"]}
This string is retrieved from S3 bucket in AWS.
Now the ETL produces the files with their version number added. "employee_details_0_0_0.csv"
The part "_0_0_0" of the string is a dynamic value which could change in future. say employee_details_1_2_3.csv.
How to get the file name correctly without the version added?
Input
input 1 : employee_details_0_0_0.csv
input 2 : employee_details_1_2_3.csv
Output
output 1 : employee_details.csv
output 2 : employee_details.csv
Replace r"(_\d+){3}(?=\.csv)" with "" (see demo)
>>> import re
>>> filename = "employee_details_0_0_0.csv"
>>> re.sub(r"(_\d+){3}(?=\.csv)", "", filename)
'employee_details.csv'
Regex explanation:
(...) is a capturing group
_ matches literally an underscore
\d is a shortand character class which matches a digit
{3} repeats three times a certain pattern
(?=...) is a positive lookahead, meaning "followed by..."
\. is an escaped period and matches literally a period
Considering the input string, we can assume that string parts representing the file names are alphabetical and versioning parts are numerical.
You may use the filter function, as well as the path to split the extension
from os import path
def clear_name(recived_name:str)-> str:
is_not_number = lambda string: not string.replace(".","").isnumeric()
name , ext = path.splitext(recived_name)
return "_".join(filter(is_not_number, name.split("_")) ) + ext
If you apply both inputs, the result will be
clear_name("employee_details_0_0_0.csv")
output: 'employee_details.csv'
clear_name("employee_details_1_2_3.csv")
output: 'employee_details.csv'
this function also remives floating numbers:
clear_name("employee_details_0.1.0.csv")
output: 'employee_details.csv'
if you wanted to see more about os path look here
also, filters docs

Change Python script parameters from another Python script

I have a main-script from which I want to make a couple of temporary copies, with slight changes in each of the copies.
main.py could look like this:
import NumPy as np
import module
import bar
...
foo = bar(label='C2H4', point=(1,0))
atoms = 'H4'
template = read('template.t')
size = template.lengths()
n = 4
alpha = 0.5
batch_size = 256 // (n * alpha)
dct = {
'1': [1, 2],
'6': [3, 4],
}
kwargs = {
'dict': dct,
'size': size,
'scale': size[0] / 10,
}
...
module(atoms, kwargs, foo)
module.run()
In another script, called parameter_search.py, I make the copy and change the parameters by running through each line in main.py, searching for what should be changed.
If the variable is found, a regex-command splits the line, and then changes the float (I'm not the best at regexes, so this could probably use some work):
import re
import fileinput
import os from shutil import copyfile
def is_num(var):
try:
float(var)
return True
except ValueError:
return False
def replace(filename, var_expr, new_val):
found = False
for line in fileinput.input(filename, inplace=1):
if var_expr in line:
if not found:
found = True
lst = re.split('(=|:|,)', line)
for i, char in enumerate(lst):
if is_num(char):
lst[i] = str(new_val)
line = "".join(lst)
else:
raise NameError(f'{var_expr} ambigous')
print(line, end="")
if not found:
raise NameError(f'{var_expr} not found in {filename}')
N = [10, 20]
alpha = [0.4, 0.6, 0.8]
foo = [bar(alpha=1), baz()]
for n in N:
for a in alpha:
for i, f in enumerate(foo):
newfile = 'main'
newfile += f'_n{n}'
newfile += f'_a{a}'
newfile += f'_f{i}'
newfile += '.py'
copyfile('main.py', newfile)
replace(newfile, 'n=', n)
replace(newfile, 'alpha=', use_n)
replace(newfile, 'foo = ', f)
This gives decent results, but problems arise if several variables are on the same line, such as bar(label='C2H4', point=(1, 0)) or the variable is a part of a dict as kwargs, the parameter is a string, a function or some other weird variable.
Is it possible to make something like replace() that is more general or in some other way makes this possible?
Okay, let me make a few assumptions here:
The parameters assignment happens only once in the .py script and it is also the first occurence of the parameter name.
There are exactly two ways to declare a parameter: as a single variable par = value or inside a dictionary {'par' : value} (with single or double quotes).
You can then use the re.sub function to directly substitute the value of the assignment, using what is called capture groups:
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of
pattern in string by the replacement repl. If the pattern isn’t found,
string is returned unchanged. repl can be a string or a function; if
it is a string, any backslash escapes in it are processed. That is, \n
is converted to a single newline character, \r is converted to a
carriage return, and so forth. Unknown escapes of ASCII letters are
reserved for future use and treated as errors. Other unknown escapes
such as \& are left alone. Backreferences, such as \6, are replaced
with the substring matched by group 6 in the pattern.
source: https://docs.python.org/3/library/re.html
So, your function to set a new value for a given parameter name could look like this:
def replace_parameter_value_if_found(string, parameter, new_value):
return new_string = re.sub("([\'\"]?"+parameter+"[\'\"]?\s*[=:]\s*)[\.\w]*", "\1"+new_value, string)
Now, let's break it down:
"([\'\"]?"+parameter+"[\'\"]?\s*[=:]\s*)[.\w]*"
Stuff enclosed in () is called a capture group - it can be
referenced later.
Stuff enclosed in [] matches any of those characters inside (the backslash serves as an escape character)
? is a quantifier matching zero or one occurrence of what it immediately succeeds
* is a quantifier matching any number of occurrences of what it immediately succeeds, including zero
any literal string matches that string
\s and \w match any whitespace and any word character (a-z0-9), respectively
So, say the parameter was 'alpha', the regex pattern becomes
"([\'\"]?alpha[\'\"]?\s*[=:]\s*)[\.\w]*"
and reads like this:
Open capture group 1
Match a single ' or " or neither (because of the ?)
Match the literal word alpha
Match a single ' or " or neither
Match any number of white space characters
Match a single = or :
Close capture group 1 (it now contains alpha= or 'alpha': or some variation of it)
Match any number of word characters or periods (this is what we will be replacing)
All this will then be replaced with what's in capture group 1, followed by the new value, hence:
"\1"+new_value
The string can be the entirety of the .py script, also keep in mind that what you are passing to the function are strings and they can be whatever you want.
Hope this helps.

How to make a regular expression 'greedy but optional'

I'm trying to write a parser for a string which represents a file path, optionally following by a colon (:) and a string representing access flags (e.g. r+ or w). The file name can itself contain colons, e.g., foo:bar.txt, so the colon separating the access flags should be the last colon in the string.
Here is my implementation so far:
import re
def parse(string):
SCHEME = r"file://" # File prefix
PATH_PATTERN = r"(?P<path>.+)" # One or more of any character
FLAGS_PATTERN = r"(?P<flags>.+)" # The letters r, w, a, b, a '+' symbol, or any digit
# FILE_RESOURCE_PATTERN = SCHEME + PATH_PATTERN + r":" + FLAGS_PATTERN + r"$" # This makes the first test pass, but the second one fail
FILE_RESOURCE_PATTERN = SCHEME + PATH_PATTERN + optional(r":" + FLAGS_PATTERN) + r"$" # This makes the second test pass, but the first one fail
tokens = re.match(FILE_RESOURCE_PATTERN, string).groupdict()
return tokens['path'], tokens['flags']
def optional(re):
'''Encloses the given regular expression in a group which matches 0 or 1 repetitions.'''
return '({})?'.format(re)
I've tried the following tests:
import pytest
def test_parse_file_with_colon_in_file_name():
assert parse("file://foo:bar.txt:r+") == ("foo:bar.txt", "r+")
def test_parse_file_without_acesss_flags():
assert parse("file://foobar.txt") == ("foobar.txt", None)
if __name__ == "__main__":
pytest.main([__file__])
The problem is that by either using or not using optional, I can make one or the other test pass, but not both. If I make r":" + FLAGS_PATTERN optional, then preceding regular expression consumes the entire string.
How can I adapt the parse method to make both tests pass?
You should build the regex like
^file://(?P<path>.+?)(:(?P<flags>[^:]+))?$
See the regex demo.
In your code, ^ anchor is not necessary as you are using re.match anchoring the match at the start of the string. The path group matches any 1+ chars lazily (thus, all the text that can be matched with Group 2 will land in the second capture), up to the first occurrence of : followed with 1+ chars other than : (if present) and then end of string position is tested. Thanks to $ anchor, the first group will match the whole string if the second optional group is not matched.
Use the following fix:
PATH_PATTERN = r"(?P<path>.+?)" # One or more of any character
FLAGS_PATTERN = r"(?P<flags>[^:]+)" # The letters r, w, a, b, a '+' symbol, or any digit
See the online Python demo.
Just for fun, I wrote this parse function, which I think is better than using RE?
def parse(string):
s = string.split('//')[-1]
try:
path, flags = s.rsplit(':', 1)
except ValueError:
path, flags = s.rsplit(':', 1)[0], None
return path, flags

Categories

Resources