Example strings:
I am a numeric string 75698
I am a alphanumeric string A14-B32-C7D
So far my regex works: (\S+)$
I want to add a way (probably look ahead) to check if the result generated by above regex contains any digit (0-9) one or more times?
This is not working: (\S+(?=\S*\d\S*))$
How should I do it?
Look ahead is not necessary for this, this is simply :
(\S*\d+\S*)
Here is a test case :
http://regexr.com?34s7v
permute it and use the \D class instead of \S:
((?=\D*\d)\S+)$
explanation: \D = [^\d] in other words it is all that is not a digit.
You can be more explicit (better performances for your examples) with:
((?=[a-zA-Z-]*\d)\[a-zA-Z\d-]+)$
and if you have only uppercase letters, you know what to do. (smaller is the class, better is the regex)
text = '''
I am a numeric string 75698 \t
I am a alphanumeric string A14-B32-C7D
I am a alphanumeric string A14-B32-C74578
I am an alphabetic number: three
'''
import re
regx = re.compile('\s(?=.*\d)([\da-zA-Z-]+)\s*$',re.MULTILINE)
print regx.findall(text)
# result ['75698', 'A14-B32-C7D', 'A14-B32-C74578']
Note the presence of \s* in front of $ in order to catch alphanumeric portions that are separated with whitespazces from the end of the lines.
Related
I have a string like so, "123.234.567 Remove numbers in this string". The output should be "Remove numbers in this string".
The digits in this string follow the pattern xx.xxx.xxxxx...(digits followed by a period), but the number of periods and digits between each period is not static. here are a couple examples. xx.xxxxxx.xxxx.xxxxxxxx, x.xx.xxxx.xxxxxxxx.xx.xxxxx, x.xx.xxxxxx, etc.
How can I remove these digits followed by periods in regex?
So far I have something like this:
patt = re.compile('(\s*)[0-9].[0-9]*.[0-9]*(\s*)')
But this only works for a specific format.
Use ^ to match the beginning of the string.
Use \d+ to match any number of digits.
Use \. to match a literal . character
Put \.\d+ in a group with () so you can quantify it to match any number of them.
Use re.sub() to replace it with an empty string to remove the match.
Use a raw string so you can put literal backslashes in the regexp without having to escape them.
patt = re.compile(r's*^\d+(?:\.\d+)+\s*')
string = patt.replace('', string)
I do have got the below string and I am looking for a way to split it in order to consistently end up with the following output
'1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
['1GB 02060250396L1.060,70',
'2BE 129517720L2.639,40',
'3NL 134187650L4.024,23',
'4DE 165893440L8.111,00',
'5PL 65775644897L3.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L8.0221,30']
My current approach
re.split("([0-9][0-9][0-9][A-Z][A-Z])", input) however is also splitting my delimiter which gives and there is no other split possible than the one I am currently using in order to remain consistent. Is it possible to split my delimiter as well and assign a part of it "70" to the string in front and a part "2BE" to the following string?
Use re.findall() instead of re.split().
You want to match
a number \d, followed by
two letters [A-Z]{2}, followed by
a space \s, followed by
a bunch of characters until you encounter a comma [^,]+, followed by
two digits \d{2}
Try it at regex101
So do:
input_str = '1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
re.findall(r"\d[A-Z]{2}\s[^,]+,\d{2}", input_str)
Which gives
['1GB 02060250396L7.067,70',
'2BE 129517720L6.633,40',
'3NL 134187650L3.824,23',
'4DE 165893440L3.111,00',
'5PL 65775644897L1.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L3.001,30']
Alternatively, if you don't want to be so specific with your pattern, you could simply use the regex
[^,]+,\d{2} Try it at regex101
This will match as many of any character except a comma, then a single comma, then two digits.
re.findall(r"[^,]+,\d{2}", input_str)
# Output:
['1GB 02060250396L7.067,70',
'2BE 129517720L6.633,40',
'3NL 134187650L3.824,23',
'4DE 165893440L3.111,00',
'5PL 65775644897L1.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L3.001,30']
Is it possible to split my delimiter as well and assign a part of it "70" to the string in front and a part "2BE" to the following string?
If you must use re.split AT ANY PRICE then you might exploit zero-length assertion for this task following way
import re
text = '1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
parts = re.split(r'(?<=,[0-9][0-9])', text)
print(parts)
output
['1GB 02060250396L7.067,70', '2BE 129517720L6.633,40', '3NL 134187650L3.824,23', '4DE 165893440L3.111,00', '5PL 65775644897L1.010,00', '6DE 811506926L3.547,40', '7AT U16235008L-830,00', '8SE U57469158L3.001,30', '']
Explanation: This particular one is positive lookbehind, it does find zero-length substring preceded by , digit digit. Note that parts has superfluous empty str at end.
I have a list of IDs, and I need to check whether these IDs are properly formatted. The correct format is as follows:
[O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9]
[A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
The string can also be followed by a dash and a number. I have two problems with my code: 1) how do I limit the length of the string to exactly the number of characters specified by the search terms? and 2) how can I specify that there can be a "-[0-9]" following the string if it matches?
potential_uniprots=['D4S359N116-2', 'DFQME6AGX4', 'Y6IT25', 'V5PG90', 'A7TD4U7ZN11', 'C3KQY5-V']
import re
def is_uniprot(ID):
status=False
uniprot1=re.compile(r'\b[O,P,Q]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot2=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot3=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
correctIDs=[]
for prot in potential_uniprots:
if is_uniprot(prot) == True:
correctIDs.append(prot)
print(correctIDs)
Expression Fixes:
BEFORE READING:
All credit for the expression fixes goes to The fourth bird's comment. Please see that comment here or under the original post:
You can omit {1} and the comma's from the character class (If you don't want to match comma's) The patterns by them selves do not contain a quantifier and have word boundaries. So between these word boundaries, you are already matching an exact amount of characters. To match an optional hyphen and digit, you can use an optional non capturing group (?:-[0-9])?
You don't need the , separating the characters in the square brackets as the brackets dictate that the regex should match all characters in the square brackets. For example, a regex such as [A-Z,0-9] is going to match an uppercase character, comma, or a digit whereas a regex such as [A-Z0-9] is going to match an uppercase character or a digit. Furthermore, you don't need the {1} as the regex will match one by default if no quantifiers are specified. This means that you can just delete the {1} from the expression.
Checking Length?
There is a simple way to do this without regex, which is as follows:
string = "Q08F88"
status = (len(string) == 6 or len(string) == 8)
But you can also force the regex to match certain lengths use \b (word-boundary), which you have already done. You can alternatively use ^ and $ at the beginning and end of the expression, respectively, to denote the beginning and end of the string.
Consider this expression: ^abcd$ (only match strings that contain abcd and nothing else)
This means that it is only going to match the string:
abcd
And not:
eabcd
abcde
This is because ^ denotes the start of the string and $ denotes the end of the string.
In the end, you're left with this first expression:
(^[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9](?:-[0-9])?$)
You can modify your other expressions easily as they follow the same structure as above.
Code Suggestions
Your code looks great, but you could make a few minor fixes to improve readability and conventions. For example, you could change this:
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
To this:
return (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
# -OR-
stats = (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
return status
Because uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID) is never going to return anything other than True or False, so it is safe to return that expression.
I am trying to create a regex that will match characters, whitespaces, but not numbers.
So hello 123 will not match, but hell o will.
I tried this:
[^\d\w]
but, I cannot find a way to add whitespaces here. I have to use \w, because my strings can contain Unicode characters.
Brief
It's unclear what exactly characters refers to, but, assuming you mean alpha characters (based on your input), this regex should work for you.
Code
See regex in use here
^(?:(?!\d)[\w ])+$
Note: This regex uses the mu flags for multiline and Unicode (multiline only necessary if input is separated by newline characters)
Results
Input
ÀÇÆ some words
ÀÇÆ some words 123
Output
This only shows matches
ÀÇÆ some words
Explanation
^ Assert position at the start of the line
(?:(?!\d)[\w ])+ Match the following one or more times (tempered greedy token)
(?!\d) Negative lookahead ensuring what follows doesn't match a digit. You can change this to (?![\d_]) if you want to ensure _ is also not used.
[\w ] Match any word character or space (matches Unicode word characters with u flag)`
$ Assert position at the end of the line
You can use a lookahead:
(?=^\D+$)[\w\s]+
In Python:
import re
strings = ['hello 123', 'hell o']
rx = re.compile(r'(?=^\D+$)[\w\s]+')
new_strings = [string for string in strings if rx.match(string)]
print(new_strings)
# ['hell o']
I'm attempting to string match 5-digit coupon codes spread throughout a HTML web page. For example, 53232, 21032, 40021 etc... I can handle the simpler case of any string of 5 digits with [0-9]{5}, though this also matches 6, 7, 8... n digit numbers. Can someone please suggest how I would modify this regular expression to match only 5 digit numbers?
>>> import re
>>> s="four digits 1234 five digits 56789 six digits 012345"
>>> re.findall(r"\D(\d{5})\D", s)
['56789']
if they can occur at the very beginning or the very end, it's easier to pad the string than mess with special cases
>>> re.findall(r"\D(\d{5})\D", " "+s+" ")
Without padding the string for special case start and end of string, as in John La Rooy answer one can use the negatives lookahead and lookbehind to handle both cases with a single regular expression
>>> import re
>>> s = "88888 999999 3333 aaa 12345 hfsjkq 98765"
>>> re.findall(r"(?<!\d)\d{5}(?!\d)", s)
['88888', '12345', '98765']
full string: ^[0-9]{5}$
within a string: [^0-9][0-9]{5}[^0-9]
Note: There is problem in using \D since \D matches any character that is not a digit , instead use \b.
\b is important here because it matches the word boundary but only at end or beginning of a word .
import re
input = "four digits 1234 five digits 56789 six digits 01234,56789,01234"
re.findall(r"\b\d{5}\b", input)
result : ['56789', '01234', '56789', '01234']
but if one uses
re.findall(r"\D(\d{5})\D", s)
output : ['56789', '01234']
\D is unable to handle comma or any continuously entered numerals.
\b is important part here it matches the empty string but only at end or beginning of a word .
More documentation: https://docs.python.org/2/library/re.html
More Clarification on usage of \D vs \b:
This example uses \D but it doesn't capture all the five digits number.
This example uses \b while capturing all five digits number.
Cheers
A very simple way would be to match all groups of digits, like with r'\d+', and then skip every match that isn't five characters long when you process the results.
You probably want to match a non-digit before and after your string of 5 digits, like [^0-9]([0-9]{5})[^0-9]. Then you can capture the inner group (the actual string you want).
You could try
\D\d{5}\D
or maybe
\b\d{5}\b
I'm not sure how python treats line-endings and whitespace there though.
I believe ^\d{5}$ would not work for you, as you likely want to get numbers that are somewhere within other text.
I use Regex with easier expression :
re.findall(r"\d{5}", mystring)
It will research 5 numerical digits. But you have to be sure not to have another 5 numerical digits in the string