Python using re to match string in a specific pattern - python

I am trying to use python re to match a string with a specific pattern.
The problem I met is, I have this expected sentence:
"It is X. not X`
X can be anything; A word, or a bunch of word, or number, or digits.
The pattern I build is:
It is \w+. not \w+
just using
string.replace("X", "\w+")
It works if X is a word, or bunch of words, or int, but not for digits. How can I build my pattern in order to match everything in this pattern?

The . is a special character in a regular expression that will match any character. So .+ will match one or more characters.
r"It is .+\. not .+"
Not that the period is escaped \., this is because in that case, you want to match an actual period.

Because .+ won't work in some cases, for example
It is quote. not a double-quote
It is a dog. not a cat
I would use this one instead :
(?<=It is ).+(?=\.)|(?<=not ).+$
Explanation
(?<=It is ).+(?=\.) Any consecutive characters precedeed by It is and followed by a point
| OR
(?<=not ).*$ Any consecutive characters precedeed by not and followed by end of line anchor
(?<=It is ).*(?=\.)|(?<=not ).*$
Demo

I have figured out, can use str.replace("X", "(\w+|\d+\.\d+)") to approach the problem. Hope can help others having the same issue.

Related

How to handle " in Regex Python

I am trying to grab fary_trigger_post in the code below using Regex. However, I don't understand why it always includes " in the end of the matched pattern, which I don't expect.
Any idea or suggestion?
re.match(
r'-instance[ "\']*(.+)[ "\']*$',
'-instance "fary_trigger_post" '.strip(),
flags=re.S).group(1)
'fary_trigger_post"'
Thank you.
The (.+) is greedy and grabs ANY character until the end of the input. If you modified your input to include characters after the final double quote (e.g. '-instance "fary_trigger_post" asdf') you would find the double quote and the remaining characters in the output (e.g. fary_trigger_post" asdf). Instead of .+ you should try [^"\']+ to capture all characters except the quotes. This should return what you expect.
re.match(r'-instance[ "\']*([^"\']+)[ "\'].*$', '-instance "fary_trigger_post" '.strip(), flags=re.S).group(1)
Also, note that I modified the end of the expression to use .* which will match any characters following the last quote.
Here's what I'd use in your matching string, but it's hard to provide a better answer without knowing all your cases:
r'-instance\s+"(.+)"\s*$'
When you try to get group 1 (i.e. (.+)) regex will follow this match to the end of string, as it can match . (any character) 1 or more times (but it will take maximum amount of times). I would suggest use the following pattern:
'-instance[ "\']*(.+)["\']+ *$'
This will require regex to match all spaces in the end and all quoutes separatelly, so that it won't be included into group 1

Python Regex: Match paragraph numbers

I am attempting to match paragraph numbers inside my block of text. Given the following sentence:
Refer to paragraph C.2.1a.5 for examples.
I would like to match the word C.2.1a.5.
My current code like so:
([0-9a-zA-Z]{1,2}\.)
Only matches C.2.1a. and es., which is not what I want. Is there a way to match the full C.2.1a.5 and not match es.?
https://regex101.com/r/cO8lqs/13723
I have attempted to use ^ and $, but doing so returns no matches.
You should use following regex to match the paragraph numbers in your text.
\b(?:[0-9a-zA-Z]{1,2}\.)+[0-9a-zA-Z]\b
Try this demo
Here is the explanation,
\b - Matches a word boundary hence avoiding matching partially in a large word like examples.
(?:[0-9a-zA-Z]{1,2}\.)+ - This matches an alphanumeric text with length one or two as you tried to match in your own regex.
[0-9a-zA-Z] - Finally the match ends with one alphanumeric character at the end. In case you want it to match one or two alphanumeric characters at the end too, just add {1,2} after it
\b - Matches a word boundary again to ensure it doesn't match partially in a large word.
EDIT:
As someone pointed out, in case your text has strings like A.A.A.A.A.A. or A.A.A or even 1.2 and you don't want to match these strings and only want to match strings that has exactly three dots within it, you should use following regex which is more specific in matching your paragraph numbers.
(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)
This new regex matches only paragraph numbers having exactly three dots and those negative look ahead/behind ensures it doesn't match partially in large string like A.A.A.A.A.A
Updated regex demo
Check these python sample codes,
import re
s = 'Refer to paragraph C.2.1a.5 for examples. Refer to paragraph A.A.A.A.A.A.A for examples. Some more A.A.A or like 1.22'
print(re.findall(r'(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)', s))
Output,
['C.2.1a.5']
Also for trying to use ^ and $, they are called start and end anchors respectively, and if you use them in your regex, then they will expect matching start of line and end of line which is not what you really intend to do hence you shouldn't be using them and like you already saw, using them won't work in this case.
If simple version is required, you can use this easy to understand and modify regex ([A-Z]{1}\.[0-9]{1,3}\.[0-9]{1,3}[a-z]{1}\.[0-9]{1,3})
I think we should keep the regex expression simple and readable.
You can use the regex
**(?:[a-zA-Z]+\.){3}[a-zA-Z]+**
Explanation -
The expression (?:[a-zA-Z]+.){3} ensures that the group (?:[a-zA-Z]+.) is to be repeated 3 times within the word. The group contains an alphabetic character followed a dot.
The word would end with an alphabetic character.
Output:
['C.2.1a.5']

regular expression ending

I have ONE string as plain text and want to extract phone numbers of any format from it.
Here is my regex:
r = re.compile(r"(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)[-\s*]\d{3}[-\.\s]??\d{4})")
It extracts the following matches correctly:
617.933.6444
(880)-567-4565
(880) 567-4565
222-333-8888
555 666 4444
9999999999
But how can I avoid getting 7986815059 when I have 798681505951 in the text?
How to make an ending for my regex? (it should not contain letters and digits after and before, exact number count must be 10)
!!!!
Decision
If somebody needs to find US phone numbers in string, use link from the last Wiktor Stribiżew comment.
You need to use word boundaries, but placing them into your pattern is not obvious. It is due to the fact that the second alternative starts with a non-word char, \(. Thus, the first \b must be added at the beginning of the first alternative, and the trailing one at the very end of the pattern:
r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)[-\s*]\d{3}[-.\s]?\d{4})\b'
^^ ^^
See the regex demo
You may also require a non-word char or start of string before (. Then add \B at the second alternative start:
r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\B\(\d{3}\)[-\s*]\d{3}[-.\s]?\d{4})\b'
^^
See another demo
Also, note that there is no need escaping a . inside a character class, it is already parsed as a literal dot in [.]. And no need using a lazy ?? quantifier, it does not make sense here and a greedy version, ?, will work equally well and will look "cleaner".

Match a sentence

I wish to chop some text into sentences.
I wish to match all text up until: a period followed by a space, a question mark followed by a space or an exclamation mark followed by a space, in an non greedy fashion.
Additionally, the punctuation might be found at the very end of the string or followed by a /r/n for example.
This will almost do it:
([^\.\?\!]*)
But I'm missing the space in the expression. How do I fix this?
Example:
I' a.m not. So? Sure about this! Actually. Should give:
I' a.m not
So
Sure about this
Actually
You can achieve such conditions by using positive lookahead assertions.
[^.?!]+(?=[.?!] )
See it here on Regexr.
When you look at the demo, The sentences at the end of a row with no following space are not matched. You can fix this by adding an alternation with the Anchor $ and using the modifier m (makes the $ match the end of a row):
[^.?!]+(?=[.?!](?: |$))
See it here on Regexr
Try this:
(.*?[!\.\?] )
.* gives all,
[] is any of these characters
then the () gives you a group to reference so you can get the match out.
Use a non-greedy match with s look ahead:
^.*?(?=[.!?]( |$))
Note how you don't have to escape those chars when they are in a character class [...].
This should do it:
^.*?(?=[!.?][\s])

Python regex positive look ahead

I have the following regex that is supposed to find sequence of words that are ended with a punctuation. The look ahead function assures that after the match there is a space and a capital letter or digit.
pat1 = re.compile(r"\w.+?[?.!](?=\s[A-Z\d])"
What is the function of the following lookahead?
pat2 = re.compile(r"\w.+?[?.!](?=\s+[A-Z\d])"
Is Python 3.2 supporting variable lookahead (\s+)? I do not get any error. Furthermore I cannot see any differences in both patterns. Both seem to work the same regardless the number of blanks that I have. Is there an explanation for the purpose of the \s+ in the look ahead?
I'm not really sure what you are tying to achieve here.
Sequence of words ended by a punctuation can be matched with something like:
re.findall(r'([\w\s]*[\?\!\.;])', s)
the lookahead requires another string to follow?
In any case:
\s requires one and only one space;
\s+ requires at least one space.
And yes, the lookahead accepts the "+" modifier even in python 2.x
The same as before but with a lookahead:
re.findall(r'([\w\s]*[\?\!\.;])(?=\s\w)', s)
or
re.findall(r'([\w\s]*[\?\!\.;])(?=\s+\w)', s)
you can try them all on something like:
s='Stefano ciao. a domani. a presto;'
Depending on your strings, the lookahead might be necessary or not, and might or might not change to have "+" more than one space option.
The difference is that the first lookahead expects exactly one whitespace character before the digit or capital letter while the second one expects at least one whitespace character but as many as possible.
The + is called a quantifier. It means 1 to n as many as possible.
To recap
\s (Exactly one whitespace character allowed. Will fail without it or with more than one.)
\s+ (At least one but maybe more whitespaces allowed.)
Further studying.
I have multiple blanks, the \w.+? continues to match the blanks until the last blank before the capital letter
To answer this comment please consider :
What does \w.+? actually matches?
A single word character [a-zA-Z0-9_] followed by at least one "any" character(except newline) but with the lazy quantifier +?. So in your case, it leaves one space so that the lookahead later matches. Therefore you consume all the blanks except one. This is why you see them at your output.

Categories

Resources