Capture text blocks with a regular expression - python

I have structured documents in the following format:
123456789|XXX|1234567|05/05/2012 00:00|81900153|Signed|LASTNAME,FIRSTNAME, M.S.|024813|XXX|3410080|DNR Order Verification:Scanned|
xyz pqs 123
[report_end]
123456789|XXX|1234567|05/05/2012 00:00|81900153|Signed|LASTNAME,FIRSTNAME, M.S.|024813|XXX|3410080|A Note|
xyz pqs 123
[report_end]
Where each record:
starts with an 11-field line delimited by |
has an intervening block of free text
ends with the tag "[report_end]"
How can I capture these three elements with a regular expression?
My approach would be to
search each line that has 11 | characters;
search each line that has [report_end];
search whatever is in between these two lines.
But I don't know how to accomplish this with regular expressions.

You can also try with:
^(?P<fields>(?:[^|]+\|){11})(?P<text>[\s\S]+?)(?P<end>\[report_end\])
DEMO

You can use something like:
r"((?:.*?\|){11}\s+(?:.*)\s+\[report_end\])"
OUTPUT:
Match 1. [0-157] `123456789|XXX|1234567|05/05/2012 00:00|81900153|Signed|LASTNAME,FIRSTNAME, M.S.|024813|XXX|3410080|DNR Order Verification:Scanned|
xyz pqs 123
[report_end]
Match 2. [159-292] `123456789|XXX|1234567|05/05/2012 00:00|81900153|Signed|LASTNAME,FIRSTNAME, M.S.|024813|XXX|3410080|A Note|
xyz pqs 123
[report_end]
DEMO
https://regex101.com/r/xY5nI9/1
Regex Explanation
((?:.*?\|){11}\s+(?:.*)\s+\[report_end\])
Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Regex syntax only
Match the regex below and capture its match into backreference number 1 «((?:.*?\|){11}\s+(?:.*)\s+\[report_end\])»
Match the regular expression below «(?:.*?\|){11}»
Exactly 11 times «{11}»
Match any single character that is NOT a line break character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “|” literally «\|»
Match a single character that is a “whitespace character” «\s+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regular expression below «(?:.*)»
Match any single character that is NOT a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match a single character that is a “whitespace character” «\s+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “[” literally «\[»
Match the character string “report_end” literally «report_end»
Match the character “]” literally «\]»
UPDATE BASED ON YOUR COMMENTS
To get 3 groups you can use:
r"((?:.*?\|){11})\s+(.*)\s+(\[report_end\])
To loop all groups:
import re
pattern = re.compile(r"((?:.*?\|){11})\s+(.*)\s+(\[report_end\])")
for (match1, match2, match3) in re.findall(pattern, string):
print match1 +"\n"+ match2 +"\n"+ match3 +"\n"
LIVE DEMO
http://ideone.com/k8sA3k

Related

Regular expression to make non-greedy

I have a text like this
EXPRESS blood| muscle| testis| normal| tumor| fetus| adult
RESTR_EXPR soft tissue/muscle tissue tumor
Right now I want to only extract the last item in EXPRESS line, which is adult.
My pattern is:
[|](.*?)\n
The code goes greedy to muscle| testis| normal| tumor| fetus| adult. Can I know if there is any way to solve this issue?
You can take the capture group value exclude matching pipe chars after matching a pipe char followed by optional spaces.
If there has to be a newline at the end of the string:
\|[^\S\n]*([^|\n]*)\n
Explanation
\| Match |
[^\S\n]* Match optional whitespace chars without newlines
( Capture group 1
[^|\n]* Match optional chars except for | or a newline
) Close group 1
\n Match a newline
Regex demo
Or asserting the end of the string:
\|[^\S\n]*([^|\n]*)$
You could use this one. It spares you the space before, handle the \r\n case and is non-greedy:
\|\s*([^\|])*?\r?\n
Tested here

Regex to match dollar amount with uppercase letter or word

I'm trying to match some sort of amount, here are all possibilities:
$5.6 million
$4,1 million
$8,1M
$6.3M
$333,333
$2 million
$5 million
I have already this regex:
\$\d{1,3}(?:,\d{3})*(?:\s+(?:thousand|[mb]illion|[MB]illion)|[M])?
See online demo.
But I'm not able to match those ones:
$5.6 million
$4,1 million
$8,1M
$6.3M
Any help would be appreciated.
Let's look at your regular expression:
\$\d{1,3}(?:,\d{3})*(?:\s+(?:thousand|[mb]illion|[MB]illion)|[M])?
\$\d{1,3} is fine. What follows? One way to answer that is to consider the following three possibilities.
The string to be matched ends ' million'
This string (which begins with a space, in case you missed that) is preceded by an empty string or a single digit preceded by a comma or period:
(?:[,.]\d)? million
Evidently, "million" can be "thousand" or "billion", and the first in last might be capitalized, so we change the expression to
(?:[,.]\d)? (?:[MmBb]illion|thousand)
One potential problem is that this matches '$5.6 millionaire'. We can avoid that problem by tacking on a word boundary preventing the match to be followed by a word character:
(?:[,.]\d)? (?:[MmBb]illion|thousand)\b
The string ends 'M'
In this case the 'M' must be preceded by a single digit preceded by a comma or period:
[,.]\dM\b
You could accept 'B' as well by changing M to [MB].
The string ends with three digits preceded by a comma
Here we need
,\d{3}\b
Here the word boundary avoids matching, for example, $333,3333'. It will not match, however, '$333,333,333' or '$333,333,333,333'. If we want to match those we could change the expression to
(?:,\d{3})+\b
or to match '$333' as well, change it to
(?:,\d{3})*\b
Construct the alternation
We therefore can use the following regular expression.
\$\d{1,3}(?:(?:[,.]\d)? (?:[MmBb]illion|thousand)\b|[,.]\dMb|,\d{3}b)
Factoring out the end-of-string anchor we obtain
\$\d{1,3}(?:(?:[,.]\d)? (?:[MmBb]illion|thousand)|[,.]\dM|,\d{3})b
Demo
You can use
(?i)\$\d+(?:[.,]\d+)*(?:\s+(?:thousand|[mb]illion)|m)?
If you need to make sure you do not match m that is part of another word:
(?i)\$\d+(?:[.,]\d+)*(?:\s+(?:thousand|[mb]illion)|m)?\b
See the regex demo. Details:
(?i) - case insensitive option
\$ - a $ char
\d+ - one or more digits
(?:[.,]\d+)* - zero or more repetitions of . or , and then one or more digits
(?:\s+(?:thousand|[mb]illion)|m)? - an optional occurrence of
\s+(?:thousand|[mb]illion) - one or more whitespaces and then thousand, million or billion
| - or
m - an m char
\b - a word boundary.

How to find repeated word in a string with regex

I have a string like codecodecodecodecode...... I need to find a repeated word in that string.
I found a way but the regular expression always returns half of the repeated part I want.
^(.*)\1+$
at the group(1) I want to see just "code"
If it is greedy, it will first match till the end of the line, and will then backtrack until it can repeat 1 or more times till the end of the string, and for an evenly divided part like this of 4 words, you can capture 2 words and match the same 2 words with the backreference \1
If you have 5 words like codecodecodecodecode as in your example there will be a single group, as the only repetition it can do until the end of the string is 5 repetitions.
The quantifier should be non greedy (and repeat 1+ times to not match an empty string) to match as least as possible characters that can be repeated to the right till the end of the string.
^(.+?)\1+$
regex demo

Regex: Separating All Caps from Numbers

I am using python regex to read documents.
I have the following line in many documents:
Dated: February 4, 2011 THE REAL COMPANY, INC
I can use python text search to easily find the lines that have "dated," but I want to pull THE REAL COMPANY, INC from the text without getting the "February 4, 2011" text.
I have tried the following:
[A-Z\s]{3,}.*INC
My understanding of this regex is it should get me all capital letters and spaces before LLP, but instead it pulls the full line.
This suggests to me I'm fundamentally missing something about how regex works with capital letters. Is there an easy and obvious explanation I'm missing?
what about using:
>>> import re
>>> txt
'Dated: February 4, 2011 THE REAL COMPANY, INC'
>>> re.findall('([A-Z][A-Z]+)', txt)
['THE', 'REAL', 'COMPANY', 'INC']
Another way around is as follows as suggested by #davedwards:
>>> re.findall('[A-Z\s]{3,}.*', txt)
[' THE REAL COMPANY, INC']
Explanation:
[A-Z\s]{3,}.*
Match a single character present in the list below [A-Z\s]{3,}
{3,} Quantifier — Matches between 3 and unlimited times, as many times as possible, giving back as needed (greedy)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
You could use
^Dated:.*?\s([A-Z ,]{3,})
And make use of the first capturing group, see a demo on regex101.com.
Your regex [A-Z\s]{3,}.*INC matches 3 or more times an uppercase character or a whitespace character followed by 0+ times any character and then INC which will match: THE REAL COMPANY, INC
What you could also do is match Dated: from the start of the string followed by a date like format and then capture what comes after in a group. Your value will be in the first capturing group:
^Dated:\s+\S+\s+\d{1,2},\s+\d{4}\s+(.*)$
Explanation
^Dated:\s+ Match dated: followed by 1+ times a whitespace character
\S+\s+ Match 1+ times not a whitespace character followed by 1+ times a whitespace character whic will match February in this case
\d{1,2}, Match 1-2 times a digit
\s+\d{4}\s+ match 1+ times a whitespace character, 4 digits, followed by 1+ times a whitespace character
(.*) Capture in a group 0+ times any character
$ Assert the end of the string
Regex demo

extracting items using regular expression in python

I have a a file which has the following :
new=['{"TES1":"=TES0"}}', '{"""TES1:IDD""": """=0x3C""", """TES1:VCC""": """=0x00"""}']
I am trying to extract the first item, TES1:=TES0 from the list. I am trying to use a regular expression to do this. This is what i tried but i am not able to grab the second item TES0.
import re
TES=re.compile('(TES[\d].)+')
for item in new:
result = TES.search(item)
print result.groups()
The result of the print was ('TES1:',). I have tried various ways to extract it but am always getting the same result. Any suggestion or help is appreciated. Thanks!
I think you are looking for findall:
import re
TES=re.compile('TES[\d].')
for item in new:
result = TES.findall(item)
print result
First Option (with quotes)
To match "TES1":"=TES0", you can use this regex:
"TES\d+":"=TES\d+"
like this:
match = re.search(r'"TES\d+":"=TES\d+"', subject)
if match:
result = match.group()
Second Option (without quotes)
If you want to get rid of the quotes, as in TES1:=TES0, you use this regex:
Search: "(TES\d+)":"(=TES\d+)"
Replace: \1:\2
like this:
result = re.sub(r'"(TES\d+)":"(=TES\d+)"', r"\1:\2", subject)
How does it work?
"(TES\d+)":"(=TES\d+)"
Match the character “"” literally "
Match the regex below and capture its match into backreference number 1 (TES\d+)
Match the character string “TES” literally (case sensitive) TES
Match a single character that is a “digit” (0–9 in any Unicode script) \d+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
Match the character string “":"” literally ":"
Match the regex below and capture its match into backreference number 2 (=TES\d+)
Match the character string “=TES” literally (case sensitive) =TES
Match a single character that is a “digit” (0–9 in any Unicode script) \d+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
Match the character “"” literally "
\1:\2
Insert the text that was last matched by capturing group number 1 \1
Insert the character “:” literally :
Insert the text that was last matched by capturing group number 2 \2
You can use a single replacement, example:
import re
result = re.sub(r'{"(TES\d)":"(=TES\d)"}}', '$1:$2', yourstr, 1)

Categories

Resources