How to get groupdict from multi-line string using regex

How to get groupdict from multi-line string using regex - python

I tried to get a dictionary from multi-line string using regex, but I have a problem with proper separation of lines.
Here is what I have tried...
import re
text = '''\n\n\nName: Clash1\nDistance: -1.274m\nImage Location: navis_raport_txt_files\\cd000001.jpg\nHardStatus: New\nClash Point: 1585.236m, 193.413m'''
clash_data = re.compile('''
(?P<clash_number>Clash\d+)\n
(?P<clash_depth>\d.\d{3})\n
(?P<image_location>cd\d+.jpg)\n
(?P<clash_status>\w{2:})\n
(?P<clash_point>.*)\n
(?P<clash_grid>\w+-\d+)\n
(?P<clash_date>.*)''', re.I | re.VERBOSE)
print(clash_data.search(text).groupdict())
This similar example works well:
import re
MHP = ['''MHP-PW-K_SZ-117-R01-UZ-01 - drawing title 123''',
'MHP-PW-K_SZ-127-R01WIP - drawing title 2',
'MHP-PW-K_SZ-107-R03-UZ-1 - drawing title 3']
fields_from_name = re.compile('''
(?P<object>\w{3})[-_]
(?P<phase>\w{2})[-_]
(?P<field>\w)[-_]
(?P<type>\w{2})[-_]
(?P<dr_number>\d{3})[-_]
[-_]?
(?P<revision>\w\d{2})?
(?P<wip_status>WIP)?
[-_]?
(?P<suplement>UZ-\d+)?
[\s-]+
(?P<drawing_title>.*)
''', re.IGNORECASE | re.VERBOSE)
for name in MHP:
print(fields_from_name.search(name).groupdict())
Why doesn't my attempt work like the example?

It is not working simply because Pattern.search() is not finding a match. Based on the working example you are mimicking, you need to also match the characters between the named capture groups that you want in your output dict (so that the entire pattern returns a match).
Following is an example using .*\n.* as a bit of a brute force way to bridge the gap between your capture groups by matching any non-newline characters after the last capture group, then matching the newline, and then matching any non-newline characters that precede the next capture group (you will probably want to be more precise than this, but it demonstrates the issue). I only included your first 3 groups because I wasn't following what you intended with the regex in your <clash_status> group.
import re
text = '\n\n\nName: Clash1\nDistance: -1.274m\nImage Location: navis_raport_txt_files\\cd000001.jpg\nHardStatus: New\nClash Point: 1585.236m, 193.413m'
clash_data = re.compile(r'(?P<clash_number>Clash\d+).*\n.*'
r'(?P<clash_depth>\d.\d{3}).*\n.*'
r'(?P<image_location>cd\d+.jpg)', re.I | re.VERBOSE)
result = clash_data.search(text).groupdict()
print(result)
# OUTPUT
# {'clash_number': 'Clash1', 'clash_depth': '1.274', 'image_location': 'cd000001.jpg'}

Related

See which component in regex alternation was captured

In regex alternation, is there a way to retrieve which alternation was matched? I just need the first alternation match, not all the alternations that match.
For example, I have a regex like this
pattern = r'(abc.*def|mno.*pqr|mno.*pqrt|.....)'
string = 'mnoxxxpqrt'
I want the output to be 'mno.*pqr'
How should I write the regex statement? Python language is preferred.

To do this efficiently without any iterations, you can put your desired sub-patterns in a list and join them into one alternation pattern with each sub-pattern enclosed in a capture group (so the resulting pattern looks like (abc.*def)|(mno.*pqr) instead of (abc.*def|mno.*pqr)). You can then obtain the group number of the sub-pattern with the Match object's lastindex attribute and in turn obtain the matching sub-pattern from the original list of sub-patterns:
import re
patterns = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
pattern = '|'.join(map('({})'.format, patterns))
string = 'mno_foobar_pqrt'
print(pattern)
print(patterns[re.search(pattern, string).lastindex - 1])
This outputs:
(abc.*def)|(mno.*pqr)|(mno.*pqrt)
mno.*pqr
Demo: https://replit.com/#blhsing/JointBruisedMention

You can use capture groups:
import re
string = 'abcxxxdef'
patterns = ['abc.*def', 'mno.*pqr']
match = re.match(r'((abc.*def)|(mno.*pqr))',string)
groups = match.groups()
alternations = []
for i in range(1, len(groups)):
if (groups[i] != None):
pattern = patterns[i-1]
break
print(pattern)
Result: mno.*pqr
Expressions inside round brackets are capture groups, they correspond to the 1st to last index of the response. The 0th index is the whole match.
Then you would need to find the index which matched. Except your patterns would need to be fined before hand.

Well you could iterate the terms in the regex alternation:
string = 'abcxxxdef'
pattern = r'(abc.*def|mno.*pqr)'
terms = pattern[1:-1].split("|")
for term in terms:
if re.search(term, string):
print("MATCH => " + term)
This prints:
MATCH => abc.*def

The right answer to the question How should I write the regex statement? should actually be:
There is no known way to write the regex statement using the provided regex pattern which will allow to extract from the regex search result the information which of the alternatives have triggered the match.
And as there is no way to do it using the given pattern it is necessary to change the regex pattern which then makes it possible to extract from the match the requested information.
A possible way around this regex engine limitation is proposed below, but it requires an additional regex pattern search and has the disadvantage that there is a chance that it fails for some special search pattern alternatives.
The below provided code allows usage of simpler regex patterns without defining groups and works the "other way around" by checking which of the alternate patterns triggers a match in the found match for the entire regex:
import re
pattern = r'abc.*def|mno.*pqr|mno.*pqrt'
text = 'mnoxxxpqrt'
match = re.match(pattern,text)[0]
print(next(p for p in pattern.split('|') if re.match(p, match)))
It might fail in case when in the text found match string fails to be also a match for the single regex pattern what can happen for example if a non-word boundary \B requirement is used in the search pattern ( as mentioned in the comments by Kelly Bundy ).
A not failing alternative solution is to perform the regex search using a modified regex pattern. Below an approach using a dictionary for defining the alternatives and a function returning the matched group:
import re
dct_alts = {1:r'(abc.*def)',2:r'(mno.*pqr)',3:r'(mno.*pqrt)'}
# ^-- the dictionary index is the index of the matching group in the found match.
text = 'mnoxxxpqrt'
def get_matched_group(dct_alts, text):
pattern = '|'.join(dct_alts.values())
re_match = re.match(pattern, text)
return(dct_alts[re_match.lastindex])
print(get_matched_group(dct_alts, text))
prints
(mno.*pqr)
For the sake of completeness a function returning a list of all of the alternatives which give a match (not only the first one which matches):
import re
lst_alts = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
text = 'mnoxxxpqrt'
def get_all_matched_groups(lst_alts, text):
matches = []
for pattern in lst_alts:
re_match = re.match(pattern, text)
if re_match:
matches.append(pattern)
return matches
print(get_all_matched_groups(lst_alts, text))
prints
['mno.*pqr', 'mno.*pqrt']

Regex string between square brackets only if '.' is within string

I'm trying to detect the text between two square brackets in Python however I only want the result where there is a "." within it.
I currently have [(.*?] as my regex, using the following example:
String To Search:
CASE[Data Source].[Week] = 'THIS WEEK'
Result:
Data Source, Week
However I need the whole string as [Data Source].[Week], (square brackets included, only if there is a '.' in the middle of the string). There could also be multiple instances where it matches.

You might write a pattern matching [...] and then repeat 1 or more times a . and again [...]
\[[^][]*](?:\.\[[^][]*])+
Explanation
\[[^][]*] Match from [...] using a negated character class
(?: Non capture group to repeat as a whole part
\.\[[^][]*] Match a dot and again [...]
)+ Close the non capture group and repeat 1+ times
See a regex demo.
To get multiple matches, you can use re.findall
import re
pattern = r"\[[^][]*](?:\.\[[^][]*])+"
s = ("CASE[Data Source].[Week] = 'THIS WEEK'\n"
"CASE[Data Source].[Week] = 'THIS WEEK'")
print(re.findall(pattern, s))
Output
['[Data Source].[Week]', '[Data Source].[Week]']
If you also want the values of between square brackets when there is not dot, you can use an alternation with lookaround assertions:
\[[^][]*](?:\.\[[^][]*])+|(?<=\[)[^][]*(?=])
Explanation
\[[^][]*](?:\.\[[^][]*])+ The same as the previous pattern
| Or
(?<=\[)[^][]*(?=]) Match [...] asserting [ to the left and ] to the right
See another regex demo

I think an alternative approach could be:
import re
pattern = re.compile("(\[[^\]]*\]\.\[[^\]]*\])")
print(pattern.findall(sss))
OUTPUT
['[Data Source].[Week]']

Where is such a regex wrong？

I am using python.
The pattern is:
re.compile(r'^(.+?)-?.*?\(.+?\)')
The text like:
text1 = 'TVTP-S2(xxxx123123)'
text2 = 'TVTP(xxxx123123)'
I expect to get TVTP

Another option to match those formats is:
^([^-()]+)(?:-[^()]*)?\([^()]*\)
Explanation
^ Start of string
([^-()]+) Capture group 1, match 1+ times any character other than - ( and )
(?:-[^()]*)? As the - is excluded from the first part, optionally match - followed by any char other than ( and )
\([^()]*\) Match from ( till ) without matching any parenthesis between them
Regex demo | Python demo
Example
import re
regex = r"^([^-()]+)(?:-[^()]*)?\([^()]*\)"
s = ("TVTP-S2(xxxx123123)\n"
"TVTP(xxxx123123)\n")
print(re.findall(regex, s, re.MULTILINE))
Output
['TVTP', 'TVTP']

This regex works:
pattern = r'^([^-]+).*\(.+?\)'
>>> re.findall(pattern, 'TVTP-S2(xxxx123123)')
['TVTP']
>>> re.findall(pattern, 'TVTP(xxxx123123)')
['TVTP']

a quick answer will be
^(\w+)(-.*?)?\((.*?)\)$
https://regex101.com/r/wL4jKe/2/

It is because the first plus is lazy, and the subsequent dash is optional, followed by a pattern that allows any character.
This allows the regex engine to choose the single letter T for the first group (because it is lazy), choose to interpret the dash as just not being there, which is allowed because it is followed by a question mark, and then have the next .* match "VTP-S2".
You can just grab non-dashes to capture, followed by nonparentheses up to the parentheses.
p=re.compile(r'^([^-]*?)[^(]*\(.+?\)')
p.search('TVTP-S2(xxxx123123) blah()').group(1)
The nonparentheses part prevents the second portion from matching 'S2(xxxx123123) blah(' in my modified example above.

Regex to fix (all the matches or none) at the end to one

I'm trying to fix the . at the end to only one in a string. For example,
line = "python...is...fun..."
I have the regex \.*$ in Ruby, which is to be replaced by a single ., as in this demo, which don't seem to work as expected. I've searched for similar posts, and the closest I'd got is this answer in Python, which suggests the following,
>>> text1 = 'python...is...fun...'
>>> new_text = re.sub(r"\.+$", ".", text1)
>>> 'python...is...fun.'
But, it fails if I've no . at the end. So, I've tried like \b\.*$, as seen here, but this fails on the 3rd test which has some ?'s at end.
My question is, why \.*$ not matches all the .'s (despite of being greedy) and how to do the problem correctly?
Expected output:
python...is...fun.
python...is...fun.
python...is...fun??.

You might use an alternation matching either 2 or more dots or assert that what is directly to the left is not one of for example ! ? or a dot itself.
In the replacement use a single dot.
(?:\.{2,}|(?<!\.))$
Explanation
(?: Non capture group for the alternation
\.{2,} Match 2 or more dots
| Or
(?<!\.) Get the position where directly to the left is not a . (which you can extend with other characters as desired)
) Close non capture group
$ End of string (Or use \Z if there can be no newline following)
Regex demo | Python demo
For example
import re
strings = [
"python...is...fun...",
"python...is...fun",
"python...is...fun??"
]
for s in strings:
new_text = re.sub(r"(?:\.{2,}|(?<!\.))$", ".", s)
print(new_text)
Output
python...is...fun.
python...is...fun.
python...is...fun??.
If an empty string should not be replaced by a dot, you can use a positive lookbehind.
(?:\.{2,}|(?<=[^\s.]))$
Regex demo

non greedy Python regex from end of string

I need to search a string in Python 3 and I'm having troubles implementing a non greedy logic starting from the end.
I try to explain with an example:
Input can be one of the following
test1 = 'AB_x-y-z_XX1234567890_84481.xml'
test2 = 'x-y-z_XX1234567890_84481.xml'
test3 = 'XX1234567890_84481.xml'
I need to find the last part of the string ending with
somestring_otherstring.xml
In all the above cases the regex should return XX1234567890_84481.xml
My best try is:
result = re.search('(_.+)?\.xml$', test1, re.I).group()
print(result)
Here I used:
(_.+)? to match "_anystring" in a non greedy mode
\.xml$ to match ".xml" in the final part of the string
The output I get is not correct:
_x-y-z_XX1234567890_84481.xml
I found some SO questions (link) explaining the regex starts from the left even with non greedy qualifier.
Could anyone explain me how to implement a non greedy regex from the right?

Your pattern (_.+)?\.xml$ captures in an optional group from the first underscore until it can match .xml at the end of the string and it does not take the number of underscores that should be between into account.
To only match the last part you can omit the capturing group. You could use a negated character class and use the anchor $ to assert the end of the line as it is the last part:
[^_]+_[^_]+\.xml$
Regex demo | Python demo
That will match
[^_]+ Match 1+ times not _
_ Match literally
[^_]+ Match 1+ times not _
\.xml$ Match .xml at the end of the string
For example:
import re
test1 = 'AB_x-y-z_XX1234567890_84481.xml'
result = re.search('[^_]+_[^_]+\.xml$', test1, re.I)
if result:
print(result.group())

Not sure if this matches what you're looking for conceptually as "non greedy from the right" - but this pattern yields the correct answer:
'[^_]+_[^_]+\.xml$'
The [^_] is a character class matching any character which is not an underscore.

You need to use this regex to capture what you want,
[^_]*_[^_]*\.xml
Demo
Check out this Python code,
import re
arr = ['AB_x-y-z_XX1234567890_84481.xml','x-y-z_XX1234567890_84481.xml','XX1234567890_84481.xml']
for s in arr:
m = re.search(r'[^_]*_[^_]*\.xml', s)
if (m):
print(m.group(0))
Prints,
XX1234567890_84481.xml
XX1234567890_84481.xml
XX1234567890_84481.xml
The problem in your regex (_.+)?\.xml$ is, (_.+)? part will start matching from the first _ and will match anything until it sees a literal .xml and whole of it is optional too as it is followed by ?. Due to which in string _x-y-z_XX1234567890_84481.xml, it will also match _x-y-z_XX1234567890_84481 which isn't the correct behavior you desired.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get groupdict from multi-line string using regex - python

Related

See which component in regex alternation was captured

Regex string between square brackets only if '.' is within string

Where is such a regex wrong？

Regex to fix (all the matches or none) at the end to one

non greedy Python regex from end of string

Categories

Resources