Python regex: Using Alternation for sets of words with delimiter - python

I want to match a string for which the string elements should contain specific characters only:
First character from [A,C,K,M,F]
Followed by a number (float or integer). Allowed instances: 1,2.5,3.6,9,0,6.3 etc.
Ending at either of these roman numerals [I, II, III, IV, V].
The regex that I am supplying is the following
bool(re.match(r'(A|C|K|M|F){1}\d+\.?\d?(I|II|III|IV|V)$', test_str))
"(I|II|III|IV|V)" part will return true for test_str='C5.3IV' but I want to make it true even if two of the roman numerals are present at the same time with a delimiter / i.e. the regex query should retrun true for test_str='C5.3IV/V' also.
How should I modify the regex?
Thanks

Try this:
bool(re.match(r'[ACKMF]\d+\.?\d?(I|II|III|IV|V)(/(I|II|III|IV|V))*$', test_str))
I also changed the start of your expression from (A|C|K|M|F){1} to [ACKMF] Characters between square brackets form a character class. Such a class matches one character out of a range of options. You most commonly see them with ranges like [A-Z0-9] to match capital letters or digits, but you can also add individual characters, as I've done for your regex.

Group the delimiter and the roman numeral and treat it the same way you treat the decimal point in the float / int (you don't know whether or not it will appear but it will only appear once at most). Hope this helps!

Related

Regex for exactly phone number with any end

I want to re.sub to change phone number format inside a string but stuck with the number detection.
I want to detect and change this format : ###-###-#### to this one: (###)-###-####
My regex :(\d{3}\-)(\d{3}\-)(\d{4})$
my sub: (\1)-\2-\3
I got stuck at that my regex can detect the number but if the number string ends like this: My number is 212-345-9999. It can not detect the number string end with any other character. When I change my regex to:(\d{3}\-)(\d{3}\-)(\d{4}) it also changes the format of number like this: 123-456-78901 with is not a number I want to detect as a phone number.
Help me
Just add the word boundary \b to your regex pattern to require boundary characters such as space, period, etc. thus disallowing any additional numbers.
(\d{3}\-)(\d{3}\-)(\d{4})\b
But that will result to duplicate dashes. Instead, don't include the dash - in the captured groups so that they doesn't duplicate in the resulting string. So use this:
(\d{3})\-(\d{3})\-(\d{4})\b
If you want a stricter pattern to ensure that the string strictly contains the indicated pattern only and nothing more, match the start and end of string. Here, we will optionally catch an ending character \W that shouldn't be a digit nor letter.
^(\d{3})\-(\d{3})\-(\d{4})\W?$
Just change \W? to \W* if you want to match arbitrary number of non-digit characters e.g. 123-456-7890.,
Sample Run:
If you intend to only process the correctly-formatted numbers, then don't call re.sub() right away. First, check if there is a match via re.match():
import re
number_re = re.compile(r"^(\d{3})\-(\d{3})\-(\d{4})\W?$")
for num in [
"123-456-7890",
"123-456-78901",
"123-456-7890.",
"123-456-7890.1",
]:
print(num)
if number_re.match(num):
print("\t", number_re.sub(r"(\1)-\2-\3", num))
else:
print("\tIncorrect format")
Output:
123-456-7890
(123)-456-7890
123-456-78901
Incorrect format
123-456-7890.
(123)-456-7890
123-456-7890.1
Incorrect format

Regex sub phone number format multiple times on same string

I'm trying to use reg expressions to modify the format of phone numbers in a list.
Here is a sample list:
["(123)456-7890 (321)-654-0987",
"(111) 111-1111",
"222-222-2222",
"(333)333.3333",
"(444).444.4444",
"blah blah blah (555) 555.5555",
"666.666.6666 random text"]
Every valid number has either a space OR start of string character leading, AND either a space OR end of string character trailing. This means that there can be random text in the strings, or multiple numbers on one line. My question is: How can I modify the format of ALL the phone numbers with my match pattern below?
I've written the following pattern to match all valid formats:
p = re.compile(r"""
(((?<=\ )|(?<=^)) #space or start of string
((\([0-9]{3}\))|[0-9]{3}) #Area code
(((-|\ )?[0-9]{3}-[0-9]{4}) #based on '-'
| #or
((\.|\ )?[0-9]{3}\.[0-9]{4})) #based on '.'
(?=(\ |$))) #space or end of string
""", re.X)
I want to modify the numbers so they adhere to the format:
\(\d{3}\)d{3}-\d{4} #Ex: (123)456-7890
I tried using re.findall, and re.sub but had no luck. I'm confused on how to deal with the circumstance of there being multiple matches on a line.
EDIT: Desired output:
["(123)456-7890 (321)654-0987",
"(111)111-1111",
"(222)222-2222",
"(333)333-3333",
"(444)444-4444",
"blah blah blah (555)555-5555",
"(666)666-6666 random text"]
Here's a more simple solution that works for all of those cases, though is a little naïve (and doesn't care about matching brackets).
\(?(\d{3})\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\1)\2-\3
Try it online
Explanation:
Works by first checking for 3 digits, and optionally surrounding brackets on either side, with \(?(\d{3})\)?. Notice that the 3 digits are in a capturing group.
Next, it checks for an optional separator character, and then another 3 digits, also stored in a capturing group: [ -.]?(\d{3}).
And lastly, it does the previous step again - but with 4 digits instead of 3: [ -.]?(\d{4})
Python:
To use it in Python, you should just be able to iterate over each element in the list and do:
p.sub('(\\1)\\2-\\3', myString) # Note the double backslashes, or...
p.sub(r'(\1)\2-\3', myString) # Raw strings work too
Example Python code
EDIT
This solution is a bit more complex, and ensures that if there is a close bracket, there must be a start bracket.
(\()?((?(1)\d{3}(?=\))|\d{3}(?!\))))\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\2)\3-\4
Try it online

regex to strict check numbers in string

Example strings:
I am a numeric string 75698
I am a alphanumeric string A14-B32-C7D
So far my regex works: (\S+)$
I want to add a way (probably look ahead) to check if the result generated by above regex contains any digit (0-9) one or more times?
This is not working: (\S+(?=\S*\d\S*))$
How should I do it?
Look ahead is not necessary for this, this is simply :
(\S*\d+\S*)
Here is a test case :
http://regexr.com?34s7v
permute it and use the \D class instead of \S:
((?=\D*\d)\S+)$
explanation: \D = [^\d] in other words it is all that is not a digit.
You can be more explicit (better performances for your examples) with:
((?=[a-zA-Z-]*\d)\[a-zA-Z\d-]+)$
and if you have only uppercase letters, you know what to do. (smaller is the class, better is the regex)
text = '''
I am a numeric string 75698 \t
I am a alphanumeric string A14-B32-C7D
I am a alphanumeric string A14-B32-C74578
I am an alphabetic number: three
'''
import re
regx = re.compile('\s(?=.*\d)([\da-zA-Z-]+)\s*$',re.MULTILINE)
print regx.findall(text)
# result ['75698', 'A14-B32-C7D', 'A14-B32-C74578']
Note the presence of \s* in front of $ in order to catch alphanumeric portions that are separated with whitespazces from the end of the lines.

Regex in Python to get string of numbers after string of letters

I have a string formatted as results_item12345. The numeric part is either four or five digits long. The letters will always be lowercase and there will always be an underscore somewhere in the non-numeric part.
I tried to extract it using the following:
import re
string = 'results_item12345'
re.search(r'[^a-z][\d]',string)
However, I only get the leftmost two digits. How can I get the entire number?
Assuming you only care about the numbers at the end of the string, the following expression matches 4 or 5 digits at the end of the string.
\d{4,5}$
Otherwise, the following would be the full regex matching the provided requirements.
^[a-z_]+\d{4,5}$
If you wanted to just match any number in the string you could search for:
r'[\d]{4,5}'
If you need validation of some sort you need to use:
r'^result_item[\d]{4,5}$'
import re
a="results_item12345"
pattern=re.compile(r"(\D+)(\d+)")
x=pattern.match(a).groups()
print x[1]

Python Regex to look for string

I have a text file with text that looks like below
Format={ Window_Type="Tabular", Tabular={ Num_row_labels=10
}
}
I need to look for Num_row_labels >=10 in my text file. How do I do that using Python 3.2 regex?
Thanks.
Assume that the data is formatted as above, and there is no leading 0's in the number:
Num_row_labels=\d{2,}
A more liberal regex which allows arbitrary spaces, still assume no leading 0's:
Num_row_labels\s*=\s*\d{2,}
An even more liberal regex which allows arbitrary spaces, and allow leading 0's:
Num_row_labels\s*=\s*0*[1-9]\d+
If you need to capture the numbers, just surround \d{2,} (in 1st and 2nd regex) or [1-9]\d+ (in 3rd regex) with parentheses () and refers to it in the 1st capture group.
Use:
match = re.search("Num_row_labels=(\d+)", line)
The (\d+) matches at least one decimal digit (0-9) and captures all digits matched as a group (groups are stored in the object returned by re.search and re.match, which I'm assigning to match here). To access the group and compare compare against 10, use:
if int(match.group(1)) >= 10:
print "Num_row_labels is at least 10"
This will allow you to easily change the value of your threshold, unlike the answers that do everything in the regex. Additionally, I believe this is more readable in that it is very obvious that you are comparing a value against 10, rather than matching a nonzero digit in the regex followed by at least one other digit. What the code above does is ask for the 1st group that was matched (match.group(1) returns the string that was matched by \d+), and then, with the call to int(), converts the string to an integer. The integer returned by int() is then compared against 10.
The regex is Num_row_labels=[1-9][0-9]{1}.*
Now you can use the re python module (take a look here) to analyze your text and extract those
the re looks like:
Num_row_labels=[0-9]*[1-9][0-9]+
Example of usage:
if re.search('Num_row_labels=[0-9]*[1-9][0-9]+', line):
print line
The regular expression [0-9]*[1-9][0-9]+ means that in the string must be at least
one digit from 1 to 9 ([1-9], symbol class [] in regular expressions means that here can be any symbol from the range specified in the brackets);
and at least one digit from 0 to 9 (but it can be more of them) ([0-9]+, the + sign in regular expression means that the symbol/expression that stand before it can be repeated 1 or more times).
Before these digits can be any other digits ([0-9]*, that means any digit, 0 or more times). When you already have two digits you can have any other digits before — the number would be greater or equal 10 anyway.

Categories

Resources