Negative LookAhead to identify a match a text sequence - python

I wanted to identify text which doesn't have any pattern following that. (From my perspective I think Negative Look ahead can be used in this case. Tried "(.*?)(?![A-Z]+:)" but was not able to get the result)
Example,
Paragraph 1: "This is a simple text. INTRODUCTION: Intro is the start of a paragraph"**
Paragraph 2: "This is a simple text"
Output Expected: I doesn't want the regex to match the paragraph1, but only match paragraph2, which doesn't have pattern "[A-Z]+:" following that.
Any help is appreciated.....

You can put the [A-Z]: part inside the negative lookahead to assert that is does not occur at the right.
You can omit the + after [A-Z] and as there is a match only, you can also omit the capturing group.
Use .+ to not match an empty string (but the dot does match a space)
^(?!.*[A-Z]:).+
Regex demo

Related

how to write regex to accept the string which end with string

I want to write a regex which accepts this:
Accept:
done
done1
done1,done2,done3
Do not accept:
done1,
done1,done2,
I tried to write this regex
([a-zA-Z]+)?(/d)?(,)([a-zA-Z]+)
but it is not working.
What's wrong? How can I fix it?
I would phrase the regex pattern as:
(?<!\S)\w+(?:,\w+)*(?!\S)
Sample script:
inp = "done done1 done1,done2,done3 done1, done1,done2,"
matches = re.findall(r'(?<!\S)\w+(?:,\w+)*(?!\S)', inp)
print(matches) # ['done', 'done1', 'done1,done2,done3']
Here is an explanation of the regex pattern:
(?<!\S) assert that what precedes is either whitespace or the start of the input
\w+ match a word
(?:,\w+)* followed by comma another word, both zero or more times
(?!\S) assert that what follows the final word is either whitespace
or the end of the input
It also depends on how you apply the regex. The regex alone (e.g. when used with re.search()) tells you whether the input contains any substring which matches your regex. In the trivial case, if you are examining one line at a time, add start and end of line anchors around your regex to force it to match the entire line.
Also, of course, notice that the regex to match a single digit is \d, not /d.
Your regex looks like you want both the alphabetics and the numbers to be optional, but the group of alphabetics and numbers to be non-empty; is that correct? One way to do that is to add a lookahead (?=[a-zA-Z\d]) before the phrase which matches both optionally.
import re
tests = """\
done
done1
done1,done2,done3
done1,
done1,done2,
"""
regex = re.compile(r'^(?=[a-zA-Z\d])[a-zA-Z]*\d?(?:,(?=[a-zA-Z\d])[a-zA-Z]*\d?)*$')
for line in tests.splitlines():
match = regex.search(line)
if match:
print(line)
The individual phrases here should be easy to understand. [a-zA-Z]* matches zero or more alphabetics, and \d? matches zero or one digits. We require one of those, followed by zero or more repetitions of a comma followed by a repeat of the first expression.
Perhaps also note that [a-zA-Z\d] is almost the same as \w (the latter also matches an underscore). If you don't care about this inexactness, the expression could be simplified. It would certainly be useful in the lookahead, where the regex after it will not match an underscore anyhow. But I've left in the more complex expression just to make the code easier to follow in relation to the original example.
Demo: https://ideone.com/4mVGDh

python regex expression to match (first multipart or simple part) rar archive

I would like match
first element in multipart rar archive,
regex (.*.)part0*1.rar
or
single part rar archive,
don't match string contains ^.*(part\d+).rar$
I use this regex:
regex = r"(.*)(?:part0*1|.*[^(part\d+)])\.rar"
I 've got some issues:
apps.rar match but apps2.rar dont match and should
LA460.6.7.rar dont match and should
apps.rar should match in group(1)="apps" not group(1)="app"
You can check snippet #regex101
Could you find the error in the regex?
Thanks
The reason that you sometimes match the last character is because the pattern (.*)(?:part0*1|.*[^(part\d+)])\.rar that you tried, first captures the whole line in capture group 1.
That capture group is followed by an alternation matching either part0*1 or .*[^(part\d+)]
You can see that the lines that have part followed by a digit at the end are matched.
But, when there is no match for part0*1 the next alternative is tried which is .*[^(part\d+)].
The second alternative matches until the end of the string (where it already is), and then matches a single character of [^(part\d+)] because using the square brackets makes it a character class without a quantifier.
One option could be using a negative lookahead asserting that the string does not contain part followed by optional zeroes and either a char 2-9 and optional digits or | 1-9 and 1 or more digits.
^(?!.*part0*(?:[2-9]\d*|[1-9]\d+)\.rar)(.+)\.rar$
Regex demo
You can search for filenames that "Either have word 'part' followed by 01/1 or don't have the word 'part' at all"
Please try below regex
(.*part0?1|^(?!.*part.*).*)\.rar
Demo

Python Regex: Match paragraph numbers

I am attempting to match paragraph numbers inside my block of text. Given the following sentence:
Refer to paragraph C.2.1a.5 for examples.
I would like to match the word C.2.1a.5.
My current code like so:
([0-9a-zA-Z]{1,2}\.)
Only matches C.2.1a. and es., which is not what I want. Is there a way to match the full C.2.1a.5 and not match es.?
https://regex101.com/r/cO8lqs/13723
I have attempted to use ^ and $, but doing so returns no matches.
You should use following regex to match the paragraph numbers in your text.
\b(?:[0-9a-zA-Z]{1,2}\.)+[0-9a-zA-Z]\b
Try this demo
Here is the explanation,
\b - Matches a word boundary hence avoiding matching partially in a large word like examples.
(?:[0-9a-zA-Z]{1,2}\.)+ - This matches an alphanumeric text with length one or two as you tried to match in your own regex.
[0-9a-zA-Z] - Finally the match ends with one alphanumeric character at the end. In case you want it to match one or two alphanumeric characters at the end too, just add {1,2} after it
\b - Matches a word boundary again to ensure it doesn't match partially in a large word.
EDIT:
As someone pointed out, in case your text has strings like A.A.A.A.A.A. or A.A.A or even 1.2 and you don't want to match these strings and only want to match strings that has exactly three dots within it, you should use following regex which is more specific in matching your paragraph numbers.
(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)
This new regex matches only paragraph numbers having exactly three dots and those negative look ahead/behind ensures it doesn't match partially in large string like A.A.A.A.A.A
Updated regex demo
Check these python sample codes,
import re
s = 'Refer to paragraph C.2.1a.5 for examples. Refer to paragraph A.A.A.A.A.A.A for examples. Some more A.A.A or like 1.22'
print(re.findall(r'(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)', s))
Output,
['C.2.1a.5']
Also for trying to use ^ and $, they are called start and end anchors respectively, and if you use them in your regex, then they will expect matching start of line and end of line which is not what you really intend to do hence you shouldn't be using them and like you already saw, using them won't work in this case.
If simple version is required, you can use this easy to understand and modify regex ([A-Z]{1}\.[0-9]{1,3}\.[0-9]{1,3}[a-z]{1}\.[0-9]{1,3})
I think we should keep the regex expression simple and readable.
You can use the regex
**(?:[a-zA-Z]+\.){3}[a-zA-Z]+**
Explanation -
The expression (?:[a-zA-Z]+.){3} ensures that the group (?:[a-zA-Z]+.) is to be repeated 3 times within the word. The group contains an alphabetic character followed a dot.
The word would end with an alphabetic character.
Output:
['C.2.1a.5']

Regex RE: All but not this pattern

quick question...
I need a regex that matches a particular letter in a code unless it is contained in a certain pattern.
I want something that matches N followed or preceded by anything aslong as it isn't preceded IMMEDIATELY by C(=O).
Example:
C(=O)N
Should not match
C(=O)CN
Should match
But it doesn't need an anchor because:
C(=O)NCCCN
Should match because of the N at the end
So far i have this:
(?!C\(=O\)N$)[N]
Any help would be appreciated.
You can use a negative lookbehind:
(?<!C\(=O\))N
See the regex demo
The N will get matched only when not preceded immediately with a literal C(=O) sequence.
The (?<!...) is called a negative lookahead. It does not consume characters (does not move the regex index), but just checks if something is absent from the string before the current position. If the text is matched, the match is failed (there is no match). See Lookarounds for more details.
In Python: r'(?<!C\(=O\))N':
import re
p = re.compile(r'(?<!C\(=O\))N')
strs = ["C(=O)N", "C(=O)CN", "C(=O)NCCCN"]
print([x for x in strs if p.search(x)])
Use a negative look-behind instead:
(?<!C\(=O\))N
See this regex101 example.
Regards.

Python reqular expressions non-greedy match

I have this code:
import re
a = r'<b>1234</b><b>56text78</b><b>9012</b>'
print re.search(r'<b>.*?text.*?</b>', a).group()
and I am trying to match a minimal block between <b> and </b> which contains 'text' anywhere in between. This code is the best I could come up with, but it matches:
<b>1234</b><b>56text78</b>
while I need:
<b>56text78</b>
instead of .* use this
print re.search(r'<b>[^<]*text[^<]*</b>', a).group()
Here you say that ignore "<" character.
Why you're getting the output as <b>1234</b><b>56text78</b> when using <b>.*?text.*?</b> regex?
Basically regex engine scans the input from left to right. So first it takes the pattern <b> from the regex and try to match against the input string. Now the engine scans the input from left to right once it finds the tag <b>, it matches that tag. Now the engine takes the second pattern along with the following string text that is .*?text. Now it matches any character upto the first text string. Why i call it as first text means , if there are more than one text strings after <b>, .*?text matches upto the first text string. So <b>1234</b><b>56text will be matched. Now the engine takes the last pattern .*?</b> and macthes upto the first </b>, so <b>1234</b><b>56text78</b> got matched.
When using this <b>[^<]*text[^<]*</b> regex, it asserts that the characters before the string (text, </b>) and after the string (<b>, text) are any but not of < character. So it prevents the engine from matching also the tags.
Why doesn't <b>.*?text produce the desired output?
This is what regexp engine does:
Takes the first character from the search pattern, which is <, and
finds it in the string, then takes the second, then the third, until
it matches <b>.
The next step takes the whole .*?text pattern and tries to find it
in the string. That's because .*? without the text part would
have no sense, as it would match 0 characters. It matches
1234</b><b>56text part and adds it to <b> found in the step 1.
It actually does produce a non-greedy output, it's just non-obvious in this case. If the string was:
`<b>1234</b><b>56text78text</b><b>9012</b>`
then the greedy '<b>.*text' match would be:
<b>1234</b><b>56text78text
and the non-greedy one '<b>.*?text' would produce the one I was getting:
<b>1234</b><b>56text
So to answer the the initial question, the correct solution will be to exclude the '<>' characters from the search:
import re
a = r'<b>1234</b><b>56text78</b><b>9012</b>'
print re.search(r'<b>[^<>]*text.*?</b>', a).group()

Categories

Resources