identify new line in regex - python

I would like to perform some regex on the text from MAcbeth
My text is as follows:
Scena Secunda.
Alarum within. Enter King Malcome, Donalbaine, Lenox, with
attendants,
meeting a bleeding Captaine.
King. What bloody man is that? he can report,
As seemeth by his plight, of the Reuolt
The newest state
My intention is to get the text from Enter to the full-stop.
I am trying this regular expression Enter(.?)*\.
But it is showing no matches. Can anybody fix my regexp?
I am trying it out in this link

Since #Tushar has not explained the issue you had with your regex, I decided to explain it.
Your regex - Enter(.?)*\. - matches a word Enter (literally), then optionally matches any character except a newline 0 or more times, as many as possible, up to the last period.
The problem is that your string contains a newline between the Enter and the period. You'd need a regex pattern to match newlines, too. To force . to match newline symbols, you may use DOTALL mode. However, it won't get you the expected result as the * quantifier is greedy (will return the longest possible substring).
So, to get the substring from Enter till the closest period, you can use
Enter([^.]*)
See this regex demo. If you need no capture group, remove it.
And an IDEONE demo:
import re
p = re.compile(r'Enter([^.]*)')
test_str = "Scena Secunda.\n\nAlarum within. Enter King Malcome, Donalbaine, Lenox, with\nattendants,\nmeeting a bleeding Captaine.\n\n King. What bloody man is that? he can report,\nAs seemeth by his plight, of the Reuolt\nThe newest state"
print(p.findall(test_str)) # if you need the capture group text, or
# print(p.search(test_str).group()) # to get the whole first match, or
# print(re.findall(r'Enter[^.]*', test_str)) # to return all substrings from Enter till the next period

Related

Python regex to identify two consecutive capitalized words at the beginning of the line

I have this piece of text from which I want to remove both occurrences of each of the names, "Remggrehte Sertrro" and "Perrhhfson Forrtdd". I tried applying this regex: ([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+) but it identifies "Remggrehte Sertrro We", "Perrhhfson Forrtdd If" and also "Mash Mush" which is inside the text.
Basically I want it to only identify first two capitalized words at the beginning of the line without touching the rest. I am no regex expert and I am not sure how to adapt it.
This is the text:
Remggrehte Sertrro
Remggrehte Sertrro We did want a 4-day work week for years.
Perrhhfson Forrtdd
Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.
Thanks in advance.
You can use this pattern /^([A-Z]+.*? ){2}/m if you are always certain that you are getting only two terms with capitalised first letters and always in the first two terms inline. Example working on regex101.com
You don't need the positive lookahead to match the first 2 capitalized words.
In your pattern, this part (?=\s[A-Z]) can be omitted as your first assert it and then directly match it.
You could match the first 2 words without a capturing group and assert a whitespace boundary (?!\S) at the right
^[A-Z][a-z]+[^\S\r\n][A-Z][a-z]+(?!\S)
Explanation
^ Start of string
[A-Z][a-z]+ Match a char A-Z and 1+ lowercase chars a-z
[^\S\r\n] Match a whitespace char except a newline as \s could also match a newline and you want to match two consecutive capitalized words at the beginning of the line
[A-Z][a-z]+ Match a char A-Z and 1+ lowercase chars a-z
(?!\S) Assert a whitespace boundary at the right
Regex demo
Note that [A-Z][a-z]+ matches only chars a-z. To match word characters you could use \w instead of [a-z] only.
You can remove the line which only contains the names using re.MULTILINE flag and the following regex: r"^(?:[A-Z]\w+\s+[A-Z]\w+\s+)$". This regex will match each name only if it fits in the line without extra text.
Here is a demo:
import re
text = """\
Remggrehte Sertrro
Remggrehte Sertrro We did want a 4-day work week for years.
Perrhhfson Forrtdd
Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.
"""
print(re.sub(r"^(?:[A-Z]\w+\s+[A-Z]\w+\s+)$", "", text, flags=re.MULTILINE))
You get:
Remggrehte Sertrro We did want a 4-day work week for years.
Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.

Python Regex: Match paragraph numbers

I am attempting to match paragraph numbers inside my block of text. Given the following sentence:
Refer to paragraph C.2.1a.5 for examples.
I would like to match the word C.2.1a.5.
My current code like so:
([0-9a-zA-Z]{1,2}\.)
Only matches C.2.1a. and es., which is not what I want. Is there a way to match the full C.2.1a.5 and not match es.?
https://regex101.com/r/cO8lqs/13723
I have attempted to use ^ and $, but doing so returns no matches.
You should use following regex to match the paragraph numbers in your text.
\b(?:[0-9a-zA-Z]{1,2}\.)+[0-9a-zA-Z]\b
Try this demo
Here is the explanation,
\b - Matches a word boundary hence avoiding matching partially in a large word like examples.
(?:[0-9a-zA-Z]{1,2}\.)+ - This matches an alphanumeric text with length one or two as you tried to match in your own regex.
[0-9a-zA-Z] - Finally the match ends with one alphanumeric character at the end. In case you want it to match one or two alphanumeric characters at the end too, just add {1,2} after it
\b - Matches a word boundary again to ensure it doesn't match partially in a large word.
EDIT:
As someone pointed out, in case your text has strings like A.A.A.A.A.A. or A.A.A or even 1.2 and you don't want to match these strings and only want to match strings that has exactly three dots within it, you should use following regex which is more specific in matching your paragraph numbers.
(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)
This new regex matches only paragraph numbers having exactly three dots and those negative look ahead/behind ensures it doesn't match partially in large string like A.A.A.A.A.A
Updated regex demo
Check these python sample codes,
import re
s = 'Refer to paragraph C.2.1a.5 for examples. Refer to paragraph A.A.A.A.A.A.A for examples. Some more A.A.A or like 1.22'
print(re.findall(r'(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)', s))
Output,
['C.2.1a.5']
Also for trying to use ^ and $, they are called start and end anchors respectively, and if you use them in your regex, then they will expect matching start of line and end of line which is not what you really intend to do hence you shouldn't be using them and like you already saw, using them won't work in this case.
If simple version is required, you can use this easy to understand and modify regex ([A-Z]{1}\.[0-9]{1,3}\.[0-9]{1,3}[a-z]{1}\.[0-9]{1,3})
I think we should keep the regex expression simple and readable.
You can use the regex
**(?:[a-zA-Z]+\.){3}[a-zA-Z]+**
Explanation -
The expression (?:[a-zA-Z]+.){3} ensures that the group (?:[a-zA-Z]+.) is to be repeated 3 times within the word. The group contains an alphabetic character followed a dot.
The word would end with an alphabetic character.
Output:
['C.2.1a.5']

Python 2.7 Regex capture groups not working as predicted

I am trying to pattern match and replace first person with second person with Python 2.7.
string = re.sub(r'(\W)I(\W)', '\g<1>you\g<2>',string)
string = re.sub(r'(\W)(me)(\W)', '\g<1>you\g<3>',string)
# but does NOT work
string = re.sub(r'(\W)I|(me)(\W)', '\g<1>you\g<3>',string)
I want to use the last regex, but somehow the capture groups are all messed up and even doing a \g<0> shows strange, irregular matches. I would think that capture group 3 would be the last word boundary, but it doesn't appear to be.
A sample sentence could be: I like candy.
I am not interested very much in the correctness of the replacement (me will never actually be selected since I goes first), but I don't know why the capture groups don't work as I would expect.
Thanks!
Try with following regex.
Regex: \b(I|me)\b
Explanation:
\b on both sides marks the word boundary.
(I|me) matches either I OR me.
Note:- You can make it case insensitive using i flag.
Regex101 Demo

Python, regular expression matching digits, x,xxx,xxx but not xx,xx,x,

first time posting, I've lurked for a little while, really excited about the helpful community here.
So, working with "Automate the boring stuff" by Al Sweigart
Doing an exercise that requires I build a regex that finds numbers in standard number format. Three digit, comma, three digits, comma, etc...
So hopefully will match 1,234 and 23,322 and 1,234,567 and 12 but not 1,23,1 or ,,1111, or anything else silly.
I have the following.
import re
testStr = '1,234,343'
matches = []
numComma = re.compile(r'^(\d{1,3})*(,\d{3})*$')
for group in numComma.findall(str(testStr)):
Num = group
print(str(Num) + '-') #Printing here to test each loop
matches.append(str(Num[0]))
#if len(matches) > 0:
# print(''.join(matches))
Which outputs this....
('1', ',343')-
I'm not sure why the middle ",234" is being skipped over. Something wrong with the regex, I'm sure. Just can't seem to wrap my head around this one.
Any help or explanation would be appreciated.
FOLLOW UP EDIT. So after following all your advice that I could assimilate, I got it to work perfectly for several inputs.
import re
testStr = '1,234,343'
numComma = re.compile(r'^(?:\d{1,3})(?:,\d{3})*$')
Num = numComma.findall(testStr)
print(Num)
gives me....
['1,234,343']
Great! BUT! What about when I change the string input to something like
'1,234,343 and 12,345'
Same code returns....
[]
Grrr... lol, this is fun, I must admit.
So the purpose of the exercise is to be able to eventually scan a block of text and pick out all the numbers in this format. Any insight? I thought this would add an additional tuple, not return an empty one...
FOLLOW UP EDIT:
So, a day later(Been busy with 3 daughters and Honey-do lists), I've finally been able to sit down and examine all the help I've received. Here's what I've come up with, and it appears to work flawlessly. Included comments for my own personal understanding. Thanks again for everything, Blckknght, Saleem, mhawke, and BHustus.
My final code:
import re
testStr = '12,454 So hopefully will match 1,234 and 23,322 and 1,234,567 and 12 but not 1,23,1 or ,,1111, or anything else silly.'
numComma = re.compile(r'''
(?:(?<=^)|(?<=\s)) # Looks behind the Match for start of line and whitespace
((?:\d{1,3}) # Matches on groups of 1-3 numbers.
(?:,\d{3})*) # Matches on groups of 3 numbers preceded by a comma
(?=\s|$)''', re.VERBOSE) # Looks ahead of match for end of line and whitespace
Num = numComma.findall(testStr)
print(Num)
Which returns:
['12,454', '1,234', '23,322', '1,234,567', '12']
Thanks again! I have had such a positive first posting experience here, amazing. =)
The issue is due to the fact you're using a repeated capturing group, (,\d{3})* in your pattern. Python's regex engine will match that against both the thousands and ones groups of your number, but only the last repetition will be captured.
I suspect you want to use non-capturing groups instead. Add ?: to the start of each set of parentheses (I'd also recommend, on general principle, to use a raw string, though you don't have escaping issues in your current pattern):
numComma = re.compile(r'^(?:\d{1,3})(?:,\d{3})*$')
Since there are no groups being captured, re.findall will return the whole matched text, which I think is what you wanted. You can also use re.find or re.search and call the group() method on the returned match object to get the whole matched text.
The problem is:
A regex match will return a tuple item for each group. However, it is important to distinguish a group from a capture. Since you only have two parenthese-delimited groups, the matches will always be tuples of two: the first group, and the second. But the second group matches twice.
1: first group, captured
,234: second group, captured
,343: also second group, which means it overwrites ,234.
Unfortunately, it seems that vanilla Python does not have a way to access any captures of a group other than the last one in a manner similar to .NET's regex implementation. However, if you are only interested in getting the specific number, your best bet would be to use re.search(number). If it returns a non-None value, then the input string is a valid number. Otherwise, it is not.
Additionally: A test on your regex. Note that, as Paul Hankin stated, test cases 6 and 7 match even though they shouldn't, due to the first * following the first capturing group, which will make the initial group match any number of times. Otherwise, your regex is correct. Fixed version.
RESPONSE TO EDIT:
The reason now that your regex returns an empty set on ' and ' is because of the ^ and $ anchors in your regex. The ^ anchor, at the start of the regex, says 'this point needs to be at the start of a string'. The $ is its counterpart, saying 'This needs to be at the end of the string'. This is good if you want your entire string from start to end to match the pattern, but if you want to pick out multiple numbers, you should do away with them.
HOWEVER!
If you leave the regex in its current form sans anchors, it will now match the individual elements of 1,23,45 as separate numbers. So for this we need to add a zero-width positive lookahead assertion and say, 'make sure that after this number is either whitespace or the end of a line'. You can see the change here. The tail end, (?=\s|$), is our lookahead assertion: it doesn't capture anything, but just makes sure criteria or met, in this case whitespace (\s) or (|) the end of a line ($).
BUT: In a similar vein, the previous regex would have matched 2 onward in "1234,567", giving us the number "234,567", which would be bad. So we use a lookbehind assertion similar to our lookahead at the end: (?<!^|\s), only match if at the beginning of the string or there is whitespace before the number. This version can be found here, and should soundly satisfy any non-decimal number related needs.
Try:
import re
p = re.compile(ur'(?:(?<=^)|(?<=\s))((?:\d{1,3})(?:,\d{3})*)(?=\s|$)', re.DOTALL)
test_str = """1,234 and 23,322 and 1,234,567 1,234,567,891 200 and 12 but
not 1,23,1 or ,,1111, or anything else silly"""
for m in re.findall(p, test_str):
print m
and it's output will be
1,234
23,322
1,234,567
1,234,567,891
200
12
You can see demo here
This regex, would match any valid number, and would never match an invalid number:
(?<=^|\s)(?:(?:0|[1-9][0-9]{0,2}(?:,[0-9]{3})*))(?=\s|$)
https://regex101.com/r/dA4yB1/1

NOT operator for regex

Using python script, I am cleaning a piece of text where I want to replace following words:
promocode, promo, code, coupon, coupon code, code.
However, I dont want to replace them if they start with a '#'. Thus, #promocode, #promo, #code, #coupon should remain the way they are.
I tried following regex for it:
1. \b(promocode|promo code|promo|coupon code|code|coupon)\b
2. (?<!#)(promocode|promo code|promo|coupon code|code|coupon)
None of them are working. I am basically looking something that will allow me to say "Does NOT start with # and" (promocode|promo code|promo|coupon code|code|coupon)
Any suggestions ?
You need to use a negative look-behind:
(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b
This (?<!#) will ensure you will only match these words if there is no # before them and \b will ensure you only match whole words. The non-capturing group (?:...) is used just for grouping purposes so as not to repeat \b around each alternative in the list (e.g. \bpromo\b|\bcode\b...). Why use non-capturing group? So that it does not interfere with the Match result. We do not need unnecessary overhead with digging out the values (=groups) we need.
See demo here
See IDEONE demo, only the first promo is deleted:
import re
p = re.compile(r'(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b')
test_str = "promo #promo "
print(p.sub('', test_str))
A couple of words about your regular expressions.
The \b(promocode|promo code|promo|coupon code|code|coupon)\b is good, but it also matches the words in the alternation group not preceded with #.
The (?<!#)(promocode|promo code|promo|coupon code|code|coupon) regex is better, but you still do not match whole words (see this demo).

Categories

Resources