I have this code:
import re
a = r'<b>1234</b><b>56text78</b><b>9012</b>'
print re.search(r'<b>.*?text.*?</b>', a).group()
and I am trying to match a minimal block between <b> and </b> which contains 'text' anywhere in between. This code is the best I could come up with, but it matches:
<b>1234</b><b>56text78</b>
while I need:
<b>56text78</b>
instead of .* use this
print re.search(r'<b>[^<]*text[^<]*</b>', a).group()
Here you say that ignore "<" character.
Why you're getting the output as <b>1234</b><b>56text78</b> when using <b>.*?text.*?</b> regex?
Basically regex engine scans the input from left to right. So first it takes the pattern <b> from the regex and try to match against the input string. Now the engine scans the input from left to right once it finds the tag <b>, it matches that tag. Now the engine takes the second pattern along with the following string text that is .*?text. Now it matches any character upto the first text string. Why i call it as first text means , if there are more than one text strings after <b>, .*?text matches upto the first text string. So <b>1234</b><b>56text will be matched. Now the engine takes the last pattern .*?</b> and macthes upto the first </b>, so <b>1234</b><b>56text78</b> got matched.
When using this <b>[^<]*text[^<]*</b> regex, it asserts that the characters before the string (text, </b>) and after the string (<b>, text) are any but not of < character. So it prevents the engine from matching also the tags.
Why doesn't <b>.*?text produce the desired output?
This is what regexp engine does:
Takes the first character from the search pattern, which is <, and
finds it in the string, then takes the second, then the third, until
it matches <b>.
The next step takes the whole .*?text pattern and tries to find it
in the string. That's because .*? without the text part would
have no sense, as it would match 0 characters. It matches
1234</b><b>56text part and adds it to <b> found in the step 1.
It actually does produce a non-greedy output, it's just non-obvious in this case. If the string was:
`<b>1234</b><b>56text78text</b><b>9012</b>`
then the greedy '<b>.*text' match would be:
<b>1234</b><b>56text78text
and the non-greedy one '<b>.*?text' would produce the one I was getting:
<b>1234</b><b>56text
So to answer the the initial question, the correct solution will be to exclude the '<>' characters from the search:
import re
a = r'<b>1234</b><b>56text78</b><b>9012</b>'
print re.search(r'<b>[^<>]*text.*?</b>', a).group()
Related
I wanted to identify text which doesn't have any pattern following that. (From my perspective I think Negative Look ahead can be used in this case. Tried "(.*?)(?![A-Z]+:)" but was not able to get the result)
Example,
Paragraph 1: "This is a simple text. INTRODUCTION: Intro is the start of a paragraph"**
Paragraph 2: "This is a simple text"
Output Expected: I doesn't want the regex to match the paragraph1, but only match paragraph2, which doesn't have pattern "[A-Z]+:" following that.
Any help is appreciated.....
You can put the [A-Z]: part inside the negative lookahead to assert that is does not occur at the right.
You can omit the + after [A-Z] and as there is a match only, you can also omit the capturing group.
Use .+ to not match an empty string (but the dot does match a space)
^(?!.*[A-Z]:).+
Regex demo
I would like match
first element in multipart rar archive,
regex (.*.)part0*1.rar
or
single part rar archive,
don't match string contains ^.*(part\d+).rar$
I use this regex:
regex = r"(.*)(?:part0*1|.*[^(part\d+)])\.rar"
I 've got some issues:
apps.rar match but apps2.rar dont match and should
LA460.6.7.rar dont match and should
apps.rar should match in group(1)="apps" not group(1)="app"
You can check snippet #regex101
Could you find the error in the regex?
Thanks
The reason that you sometimes match the last character is because the pattern (.*)(?:part0*1|.*[^(part\d+)])\.rar that you tried, first captures the whole line in capture group 1.
That capture group is followed by an alternation matching either part0*1 or .*[^(part\d+)]
You can see that the lines that have part followed by a digit at the end are matched.
But, when there is no match for part0*1 the next alternative is tried which is .*[^(part\d+)].
The second alternative matches until the end of the string (where it already is), and then matches a single character of [^(part\d+)] because using the square brackets makes it a character class without a quantifier.
One option could be using a negative lookahead asserting that the string does not contain part followed by optional zeroes and either a char 2-9 and optional digits or | 1-9 and 1 or more digits.
^(?!.*part0*(?:[2-9]\d*|[1-9]\d+)\.rar)(.+)\.rar$
Regex demo
You can search for filenames that "Either have word 'part' followed by 01/1 or don't have the word 'part' at all"
Please try below regex
(.*part0?1|^(?!.*part.*).*)\.rar
Demo
I am attempting to match paragraph numbers inside my block of text. Given the following sentence:
Refer to paragraph C.2.1a.5 for examples.
I would like to match the word C.2.1a.5.
My current code like so:
([0-9a-zA-Z]{1,2}\.)
Only matches C.2.1a. and es., which is not what I want. Is there a way to match the full C.2.1a.5 and not match es.?
https://regex101.com/r/cO8lqs/13723
I have attempted to use ^ and $, but doing so returns no matches.
You should use following regex to match the paragraph numbers in your text.
\b(?:[0-9a-zA-Z]{1,2}\.)+[0-9a-zA-Z]\b
Try this demo
Here is the explanation,
\b - Matches a word boundary hence avoiding matching partially in a large word like examples.
(?:[0-9a-zA-Z]{1,2}\.)+ - This matches an alphanumeric text with length one or two as you tried to match in your own regex.
[0-9a-zA-Z] - Finally the match ends with one alphanumeric character at the end. In case you want it to match one or two alphanumeric characters at the end too, just add {1,2} after it
\b - Matches a word boundary again to ensure it doesn't match partially in a large word.
EDIT:
As someone pointed out, in case your text has strings like A.A.A.A.A.A. or A.A.A or even 1.2 and you don't want to match these strings and only want to match strings that has exactly three dots within it, you should use following regex which is more specific in matching your paragraph numbers.
(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)
This new regex matches only paragraph numbers having exactly three dots and those negative look ahead/behind ensures it doesn't match partially in large string like A.A.A.A.A.A
Updated regex demo
Check these python sample codes,
import re
s = 'Refer to paragraph C.2.1a.5 for examples. Refer to paragraph A.A.A.A.A.A.A for examples. Some more A.A.A or like 1.22'
print(re.findall(r'(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)', s))
Output,
['C.2.1a.5']
Also for trying to use ^ and $, they are called start and end anchors respectively, and if you use them in your regex, then they will expect matching start of line and end of line which is not what you really intend to do hence you shouldn't be using them and like you already saw, using them won't work in this case.
If simple version is required, you can use this easy to understand and modify regex ([A-Z]{1}\.[0-9]{1,3}\.[0-9]{1,3}[a-z]{1}\.[0-9]{1,3})
I think we should keep the regex expression simple and readable.
You can use the regex
**(?:[a-zA-Z]+\.){3}[a-zA-Z]+**
Explanation -
The expression (?:[a-zA-Z]+.){3} ensures that the group (?:[a-zA-Z]+.) is to be repeated 3 times within the word. The group contains an alphabetic character followed a dot.
The word would end with an alphabetic character.
Output:
['C.2.1a.5']
I am attempting to extract some some raw strings using re module in python. The end of a to-be-extracted section is identified by a repeating word (repeated multiple times), Current efforts always captures the last match of the repeating word. How can I modify this behavior?
A textfile has been extracted from a pdf. The entire PDF is stored as one string. A general formatting of the string is as below:
*"***Start of notes: Collection of alphanumeric words and characters EndofsectionTopic A: string of words Endofsection"*
The intended string to be captured is: "Collection of alphanumeric words and characters"
The attempted solution used in this situation was: "
re.compile(r"*{3}Start of notes:(.+)\sEndofsection")
This attempt tends to match the whole string rather than just "Collection of alphanumeric words and characters" as intended.
One possible approach is to split with Endofsection and then extract the string from the first section only - this works, but I was hoping to find a more elegant solution using re.compile.
Two problems in your regex,
You need to escape * as it is a meta character as \*
Second, you are using (.+) which is a greedy quantifier and will try matching as much as possible, but since you want the shortest match, you need to just change it to (.+?)
Fixing these two issues, gives you the correct intended match.
Regex Demo
Python code,
import re
s = "***Start of notes: Collection of alphanumeric words and characters EndofsectionTopic A: string of words Endofsection"
m = re.search(r'\*{3}Start of notes:(.+?)\sEndofsection', s)
if m:
print(m.group(1))
Prints,
Collection of alphanumeric words and characters
Given is the following python script:
text = '<?xml version="1.24" encoding="utf-8">'
mu = (".??[?]?[?]", "....")
for item in mu:
print item,":",re.search(item, text).group()
Can someone please explain why the first hit with the regex .??[?]?[?] returns <? instead of just ?.
My explaination:
.?? should match nothing as .? can match or not any char and the second ? makes it not greedy.
[?]? can match ? or not, so nothing is good, too
[?] just matches ?
That should result in ? and not in <?
For the same reason o*?bar matches oobar in foobar. Even if the quantifier is non-greedy the regex will try to match from the first char in all possible ways, before moving on to the next.
First the .?? matches an empty string, but when the regex engine backtracks to it, it matches <, thus making the rest of the regex match, without moving the start position of the match to the next character.
Regex "greediness" only affects backtracking; it doesn't mean that the regex engine will skip earlier potential match points — a regex always takes the first possible match. In this case, that means <? because it starts farther to the left than ?.