I want to re.sub to change phone number format inside a string but stuck with the number detection.
I want to detect and change this format : ###-###-#### to this one: (###)-###-####
My regex :(\d{3}\-)(\d{3}\-)(\d{4})$
my sub: (\1)-\2-\3
I got stuck at that my regex can detect the number but if the number string ends like this: My number is 212-345-9999. It can not detect the number string end with any other character. When I change my regex to:(\d{3}\-)(\d{3}\-)(\d{4}) it also changes the format of number like this: 123-456-78901 with is not a number I want to detect as a phone number.
Help me
Just add the word boundary \b to your regex pattern to require boundary characters such as space, period, etc. thus disallowing any additional numbers.
(\d{3}\-)(\d{3}\-)(\d{4})\b
But that will result to duplicate dashes. Instead, don't include the dash - in the captured groups so that they doesn't duplicate in the resulting string. So use this:
(\d{3})\-(\d{3})\-(\d{4})\b
If you want a stricter pattern to ensure that the string strictly contains the indicated pattern only and nothing more, match the start and end of string. Here, we will optionally catch an ending character \W that shouldn't be a digit nor letter.
^(\d{3})\-(\d{3})\-(\d{4})\W?$
Just change \W? to \W* if you want to match arbitrary number of non-digit characters e.g. 123-456-7890.,
Sample Run:
If you intend to only process the correctly-formatted numbers, then don't call re.sub() right away. First, check if there is a match via re.match():
import re
number_re = re.compile(r"^(\d{3})\-(\d{3})\-(\d{4})\W?$")
for num in [
"123-456-7890",
"123-456-78901",
"123-456-7890.",
"123-456-7890.1",
]:
print(num)
if number_re.match(num):
print("\t", number_re.sub(r"(\1)-\2-\3", num))
else:
print("\tIncorrect format")
Output:
123-456-7890
(123)-456-7890
123-456-78901
Incorrect format
123-456-7890.
(123)-456-7890
123-456-7890.1
Incorrect format
Related
I have a list of IDs, and I need to check whether these IDs are properly formatted. The correct format is as follows:
[O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9]
[A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
The string can also be followed by a dash and a number. I have two problems with my code: 1) how do I limit the length of the string to exactly the number of characters specified by the search terms? and 2) how can I specify that there can be a "-[0-9]" following the string if it matches?
potential_uniprots=['D4S359N116-2', 'DFQME6AGX4', 'Y6IT25', 'V5PG90', 'A7TD4U7ZN11', 'C3KQY5-V']
import re
def is_uniprot(ID):
status=False
uniprot1=re.compile(r'\b[O,P,Q]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot2=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot3=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
correctIDs=[]
for prot in potential_uniprots:
if is_uniprot(prot) == True:
correctIDs.append(prot)
print(correctIDs)
Expression Fixes:
BEFORE READING:
All credit for the expression fixes goes to The fourth bird's comment. Please see that comment here or under the original post:
You can omit {1} and the comma's from the character class (If you don't want to match comma's) The patterns by them selves do not contain a quantifier and have word boundaries. So between these word boundaries, you are already matching an exact amount of characters. To match an optional hyphen and digit, you can use an optional non capturing group (?:-[0-9])?
You don't need the , separating the characters in the square brackets as the brackets dictate that the regex should match all characters in the square brackets. For example, a regex such as [A-Z,0-9] is going to match an uppercase character, comma, or a digit whereas a regex such as [A-Z0-9] is going to match an uppercase character or a digit. Furthermore, you don't need the {1} as the regex will match one by default if no quantifiers are specified. This means that you can just delete the {1} from the expression.
Checking Length?
There is a simple way to do this without regex, which is as follows:
string = "Q08F88"
status = (len(string) == 6 or len(string) == 8)
But you can also force the regex to match certain lengths use \b (word-boundary), which you have already done. You can alternatively use ^ and $ at the beginning and end of the expression, respectively, to denote the beginning and end of the string.
Consider this expression: ^abcd$ (only match strings that contain abcd and nothing else)
This means that it is only going to match the string:
abcd
And not:
eabcd
abcde
This is because ^ denotes the start of the string and $ denotes the end of the string.
In the end, you're left with this first expression:
(^[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9](?:-[0-9])?$)
You can modify your other expressions easily as they follow the same structure as above.
Code Suggestions
Your code looks great, but you could make a few minor fixes to improve readability and conventions. For example, you could change this:
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
To this:
return (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
# -OR-
stats = (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
return status
Because uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID) is never going to return anything other than True or False, so it is safe to return that expression.
I am working on validation of inputs and need a regex which take only number with max length of 2 and one white space between them.
Regex for Python
import re
pattern="^[0-9_ ]{2}$"
check="01 03"
a=re.match(pattern,check)
if a == None:
print'Not valid value'
else:
print"valid value"
the output which i get is non valid value, what am i going wrong here
You're repeating a character set with {2}, which will match exactly two of the preceeding token. There will only be a match if the string contains exactly two characters.
Instead, use the character set [0-9]{1,2} to match one or two digits, followed by a space, followed by that repeated character set again:
[0-9]{1,2} [0-9]{1,2}$
I want to match a string for which the string elements should contain specific characters only:
First character from [A,C,K,M,F]
Followed by a number (float or integer). Allowed instances: 1,2.5,3.6,9,0,6.3 etc.
Ending at either of these roman numerals [I, II, III, IV, V].
The regex that I am supplying is the following
bool(re.match(r'(A|C|K|M|F){1}\d+\.?\d?(I|II|III|IV|V)$', test_str))
"(I|II|III|IV|V)" part will return true for test_str='C5.3IV' but I want to make it true even if two of the roman numerals are present at the same time with a delimiter / i.e. the regex query should retrun true for test_str='C5.3IV/V' also.
How should I modify the regex?
Thanks
Try this:
bool(re.match(r'[ACKMF]\d+\.?\d?(I|II|III|IV|V)(/(I|II|III|IV|V))*$', test_str))
I also changed the start of your expression from (A|C|K|M|F){1} to [ACKMF] Characters between square brackets form a character class. Such a class matches one character out of a range of options. You most commonly see them with ranges like [A-Z0-9] to match capital letters or digits, but you can also add individual characters, as I've done for your regex.
Group the delimiter and the roman numeral and treat it the same way you treat the decimal point in the float / int (you don't know whether or not it will appear but it will only appear once at most). Hope this helps!
I'm trying to use reg expressions to modify the format of phone numbers in a list.
Here is a sample list:
["(123)456-7890 (321)-654-0987",
"(111) 111-1111",
"222-222-2222",
"(333)333.3333",
"(444).444.4444",
"blah blah blah (555) 555.5555",
"666.666.6666 random text"]
Every valid number has either a space OR start of string character leading, AND either a space OR end of string character trailing. This means that there can be random text in the strings, or multiple numbers on one line. My question is: How can I modify the format of ALL the phone numbers with my match pattern below?
I've written the following pattern to match all valid formats:
p = re.compile(r"""
(((?<=\ )|(?<=^)) #space or start of string
((\([0-9]{3}\))|[0-9]{3}) #Area code
(((-|\ )?[0-9]{3}-[0-9]{4}) #based on '-'
| #or
((\.|\ )?[0-9]{3}\.[0-9]{4})) #based on '.'
(?=(\ |$))) #space or end of string
""", re.X)
I want to modify the numbers so they adhere to the format:
\(\d{3}\)d{3}-\d{4} #Ex: (123)456-7890
I tried using re.findall, and re.sub but had no luck. I'm confused on how to deal with the circumstance of there being multiple matches on a line.
EDIT: Desired output:
["(123)456-7890 (321)654-0987",
"(111)111-1111",
"(222)222-2222",
"(333)333-3333",
"(444)444-4444",
"blah blah blah (555)555-5555",
"(666)666-6666 random text"]
Here's a more simple solution that works for all of those cases, though is a little naïve (and doesn't care about matching brackets).
\(?(\d{3})\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\1)\2-\3
Try it online
Explanation:
Works by first checking for 3 digits, and optionally surrounding brackets on either side, with \(?(\d{3})\)?. Notice that the 3 digits are in a capturing group.
Next, it checks for an optional separator character, and then another 3 digits, also stored in a capturing group: [ -.]?(\d{3}).
And lastly, it does the previous step again - but with 4 digits instead of 3: [ -.]?(\d{4})
Python:
To use it in Python, you should just be able to iterate over each element in the list and do:
p.sub('(\\1)\\2-\\3', myString) # Note the double backslashes, or...
p.sub(r'(\1)\2-\3', myString) # Raw strings work too
Example Python code
EDIT
This solution is a bit more complex, and ensures that if there is a close bracket, there must be a start bracket.
(\()?((?(1)\d{3}(?=\))|\d{3}(?!\))))\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\2)\3-\4
Try it online
I have a string formatted as results_item12345. The numeric part is either four or five digits long. The letters will always be lowercase and there will always be an underscore somewhere in the non-numeric part.
I tried to extract it using the following:
import re
string = 'results_item12345'
re.search(r'[^a-z][\d]',string)
However, I only get the leftmost two digits. How can I get the entire number?
Assuming you only care about the numbers at the end of the string, the following expression matches 4 or 5 digits at the end of the string.
\d{4,5}$
Otherwise, the following would be the full regex matching the provided requirements.
^[a-z_]+\d{4,5}$
If you wanted to just match any number in the string you could search for:
r'[\d]{4,5}'
If you need validation of some sort you need to use:
r'^result_item[\d]{4,5}$'
import re
a="results_item12345"
pattern=re.compile(r"(\D+)(\d+)")
x=pattern.match(a).groups()
print x[1]