Why do these two expressions return the same output?
phillip = '#awesome '
nltk.re_show('\w+|[^\w\s]+', phillip)
vs.
nltk.re_show('\w+|[^\w]+', phillip)
Both return:
{#}{awesome}
Why doesn't the second one return
{#}{awesome}{ }?
It appears this that nltk right-strips whitespace in strings before applying the regex.
See the source code (or you could import inspect and print inspect.get_source(nltk.re_show))
def re_show(regexp, string, left="{", right="}"):
"""docstring here -- I stripped it for brevity"""
print(re.compile(regexp, re.M).sub(left + r"\g<0>" + right, string.rstrip()))
In particular, see the string.rstrip(), which strips all trailing whitespace.
For example, if you make sure that your phillip string does not have a space to the right:
nltk.re_show('\w+|[^\w]+', phillip + '.')
# {#}{awesome}{ .}
Not sure why nltk would do this, it seems like a bug to me...
\w looks to match [A-Za-z0-9_]. And since you are looking for one OR the other (1+ "word" characters OR 1+ non-"word" characters), it matches the first character as a \w character and keeps going until it hits a non-match .
If you do a global match, you will see that there is another match containing the space (the first non-"word" character).
Related
import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:[ <$])", "+1.222.222.2222<")
The above code works fine if my string ends with a "<" or space. But if it's the end of the string, it doesn't work. How do I get +1.222.222.2222 to return in this condition:
import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:[ <$])", "+1.222.222.2222")
*I removed the "<" and just terminated the string. It returns none in this case. But I'd like it to return the full string -- +1.222.222.2222
POSSIBLE ANSWER:
import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:[ <]|$)", "+1.222.222.2222")
I think you've solved the end-of-string issue, but there are a couple of other potential issues with the pattern in your question:
the - in [ -.] either needs to be escaped as \- or placed in the first or last position within square brackets, e.g. [-. ] or [ .-]; if you search for [] in the docs here you'll find the relevant info:
Ranges of characters can be indicated by giving two characters and separating them
by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match
all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal
digit. If - is escaped (e.g. [a\-z]) or if it’s placed as the first or last character
(e.g. [-a] or [a-]), it will match a literal '-'.
you may want to require that either matching parentheses or none are present around the first 3 of 10 digits using (?:\(\d{3}\) ?|\d{3}[-. ]?)
Here's a possible tweak incorporating the above
import re
pat = "^((?:\+1[-. ]?|1[-. ]?)?(?:\(\d{3}\) ?|\d{3}[-. ]?)\d{3}[-. ]?\d{4})(?:[ <]|$)"
print( re.findall(pat, "+1.222.222.2222") )
print( re.findall(pat, "+1(222)222.2222") )
print( re.findall(pat, "+1(222.222.2222") )
Output:
['+1.222.222.2222']
['+1(222)222.2222']
[]
Maybe try:
import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:| |<|$)", "+1.222.222.2222")
null matches any position, +1.222.222.2222
matches space character, +1.222.222.2222
< matches less-than sign character, +1.222.222.2222<
$ end of line, +1.222.222.2222
You can also use regex101 for easier debugging.
I want to extract the number before "2022" in a set of strings possibly. I current do
a= mystring.strip().split("2022")[0]
and, for instance, when mystring=' 1020220519AX', this gives a = '10'. However,
mystring.strip().split("2022")[0]
fails when mystring=' 20220220519AX' to return a='202'. Therefore, I want the code to split the string on "2022" that is not at the beginning non-whitespace characters in the string.
Can you please guide with this?
Use a regular expression rather than split().
import re
mystring = ' 20220220519AX'
match = re.search(r'^\s*(\d+?)2022', mystring)
if match:
print(match.group(1))
^\s* skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.
You can tell a regex engine that you want all the digits before 2022:
r'\d+(?=2022)'
Like .split(), a regex engine is 'greedy' by default - 'greedy' here means that as soon as it can take something that it is instructed to take, it will take that and it won't try another option, unless the rest of the expression cannot be made to work.
So, in your case, mystring.strip().split("2022") splits on the first 2020 it can find and since there's nothing stopping it, that is the result you have to work with.
Using regex, you can even tell it you're not interested in the 2022, but in the numbers before it: the \d+ will match as long a string of digits it can find (greedy), but the (?=2022) part says it must be followed by a literal 2022 to be a match (and that won't be part of the match, a 'positive lookahead').
Using something like:
import re
mystring = ' 20220220519AX'
print(re.findall(r'\d+(?=2022)', mystring))
Will show you all consecutive matches.
Note that for a string like ' 920220220519AX 12022', it will find ['9202', '1'] and only that - it won't find all possible combinations of matches. The first, greedy pass through the string that succeeds is the answer you get.
You could split() asserting not the start of the string to the left after using strip(), or you can get the first occurrence of 1 or more digits from the start of the string, in case there are more occurrences of 2022
import re
strings = [
' 1020220519AX',
' 20220220519AX'
]
for s in strings:
parts = re.split(r"(?<!^)2022", s.strip())
if parts:
print(parts[0])
for s in strings:
m = re.match(r"\s*(\d+?)2022", s)
if m:
print(m.group(1))
Both will output
10
202
Note that the split variant does not guarantee that the first part consists of digits, it is only splitted.
If the string consists of only word characters, splitting on \B2022 where \B means non a word boundary, will also prevent splitting at the start of the example string.
I want to split '10.1 This is a sentence. Another sentence.'
as ['10.1 This is a sentence', 'Another sentence'] and split '10.1. This is a sentence. Another sentence.' as ['10.1. This is a sentence', 'Another sentence']
I have tried
s.split(r'\D.\D')
It doesn't work, how can this be solved?
If you plan to split a string on a . char that is not preceded or followed with a digit, and that is not at the end of the string a splitting approach might work for you:
re.split(r'(?<!\d)\.(?!\d|$)', text)
See the regex demo.
If your strings can contain more special cases, you could use a more customizable extracting approach:
re.findall(r'(?:\d+(?:\.\d+)*\.?|[^.])+', text)
See this regex demo. Details:
(?:\d+(?:\.\d+)*\.?|[^.])+ - a non-capturing group that matches one or more occurrences of
\d+(?:\.\d+)*\.? - one or more digits (\d+), then zero or more sequences of . and one or more digits ((?:\.\d+)*) and then an optional . char (\.?)
| - or
[^.] - any char other than a . char.
All sentences (except the very last one) end with a period followed by space, so split on that. Worrying about the clause number is backwards. You could potentially find all kinds of situations that you DON'T want, but it is generally much easier to describe the situation that you DO want. In this case '. ' is that situation.
import re
doc = '10.1 This is a sentence. Another sentence.'
def sentences(doc):
#split all sentences
s = re.split(r'\.\s+', doc)
#remove empty index or remove period from absolute last index, if present
if s[-1] == '':
s = s[0:-1]
elif s[-1].endswith('.'):
s[-1] = s[-1][:-1]
#return sentences
return s
print(sentences(doc))
The way I structured my regex it should also eliminate arbitrary whitespace between paragraphs.
You have multiple issues:
You're not using re.split(), you're using str.split().
You haven't escaped the ., use \. instead.
You're not using lookahead and lookbehinds so your 3 characters are gone.
Fixed code:
>>> import re
>>> s = '10.1 This is a sentence. Another sentence.'
>>> re.split(r"(?<=\D\.)(?=\D)", s)
['10.1 This is a sentence.', ' Another sentence.']
Basically, (?<=\D\.) finds a position right after a . that has a non-digit character. (?=\D) then makes sure there's a non digit after the current position. When everything applies, it splits correctly.
I have a list of IDs, and I need to check whether these IDs are properly formatted. The correct format is as follows:
[O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9]
[A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
The string can also be followed by a dash and a number. I have two problems with my code: 1) how do I limit the length of the string to exactly the number of characters specified by the search terms? and 2) how can I specify that there can be a "-[0-9]" following the string if it matches?
potential_uniprots=['D4S359N116-2', 'DFQME6AGX4', 'Y6IT25', 'V5PG90', 'A7TD4U7ZN11', 'C3KQY5-V']
import re
def is_uniprot(ID):
status=False
uniprot1=re.compile(r'\b[O,P,Q]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot2=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot3=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
correctIDs=[]
for prot in potential_uniprots:
if is_uniprot(prot) == True:
correctIDs.append(prot)
print(correctIDs)
Expression Fixes:
BEFORE READING:
All credit for the expression fixes goes to The fourth bird's comment. Please see that comment here or under the original post:
You can omit {1} and the comma's from the character class (If you don't want to match comma's) The patterns by them selves do not contain a quantifier and have word boundaries. So between these word boundaries, you are already matching an exact amount of characters. To match an optional hyphen and digit, you can use an optional non capturing group (?:-[0-9])?
You don't need the , separating the characters in the square brackets as the brackets dictate that the regex should match all characters in the square brackets. For example, a regex such as [A-Z,0-9] is going to match an uppercase character, comma, or a digit whereas a regex such as [A-Z0-9] is going to match an uppercase character or a digit. Furthermore, you don't need the {1} as the regex will match one by default if no quantifiers are specified. This means that you can just delete the {1} from the expression.
Checking Length?
There is a simple way to do this without regex, which is as follows:
string = "Q08F88"
status = (len(string) == 6 or len(string) == 8)
But you can also force the regex to match certain lengths use \b (word-boundary), which you have already done. You can alternatively use ^ and $ at the beginning and end of the expression, respectively, to denote the beginning and end of the string.
Consider this expression: ^abcd$ (only match strings that contain abcd and nothing else)
This means that it is only going to match the string:
abcd
And not:
eabcd
abcde
This is because ^ denotes the start of the string and $ denotes the end of the string.
In the end, you're left with this first expression:
(^[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9](?:-[0-9])?$)
You can modify your other expressions easily as they follow the same structure as above.
Code Suggestions
Your code looks great, but you could make a few minor fixes to improve readability and conventions. For example, you could change this:
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
To this:
return (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
# -OR-
stats = (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
return status
Because uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID) is never going to return anything other than True or False, so it is safe to return that expression.
I want to get the content between single quotes, but only if it contains a certain word (i.e 'sample_2'). It additionally should not match ones with white space.
Input example: (The following should match and return only: ../sample_2/file and sample_2/file)
['asdf', '../sample_2/file', 'sample_2/file', 'example with space', sample_2, sample]
Right now I just have that matched the first 3 items in the list:
'(.\S*?)'
I can't seem to find the right regex that would return those containing the word 'sample_2'
If you want specific words/characters you need to have them in the regular expression and not use the '\S'. The \S is the equivalent to [^\r\n\t\f\v ] or "any non-whitespace character".
import re
teststr = "['asdf', '../sample_2/file', 'sample_2/file', 'sample_2 with spaces','example with space', sample_2, sample]"
matches = re.findall(r"'([^\s']*sample_2[^\s]*?)',", teststr)
# ['../sample_2/file', 'sample_2/file']
Based on your wording, you suggest the desired word can change. In that case, I would recommend using re.compile() to dynamically create a string which then defines the regular expression.
import re
word = 'sample_2'
teststr = "['asdf', '../sample_2/file', 'sample_2/file', ' sample_2 with spaces','example with space', sample_2, sample]"
regex = re.compile("'([^'\\s]*"+word+"[^\\s]*?)',")
matches = regex.findall(teststr)
# ['../sample_2/file', 'sample_2/file']
Also if you haven't heard of this tool yet, check out regex101.com. I always build my regular expressions here to make sure I get them correct. It gives you the references, explanation of what is happening and even lets you test it right there in the browser.
Explanation of regex
regex = r"'([^\s']*sample_2[^\s]*?)',"
Find first apostrophe, start group capture. Capture anything except a whitespace character or the corresponding ending apostrophe. It must see the letters "sample_2" before accepting any non-whitespace character. Stop group capture when you see the closing apostrophe and a comma.
Note: In python, a string " or ' prepositioned with the character 'r' means the text is compiled as a regular expression. Strings with the character 'r' also do not require double-escape '\' characters.