Capture text groups in paragraph Regex - python

I want to capture header and their corresponding value from the paragraph.
Example,
Paragraph : "INTRODUCTION: There was a beautiful village. CONCLUSION: End of the story"
Regex used: \b([A-Z]+(?:\s+[A-Z]+)):\s(.?)(?=\s\b(?:[A-Z]+(?:\s+[A-Z]+)*):|$)
Output: [('INTRODUCTION', 'There was a beautiful village'), ('CONCLUSION': 'End of the story')]
There is no problem in this. But sometime I get patterns like,
Paragraph: "There was a beautiful garden once. INTRODUCTION: There was a beautiful village. CONCLUSION: End of the story"
Output Expected: [('FreeText', 'There was a beautiful garden once'), ('INTRODUCTION', 'There was a beautiful village'), ('CONCLUSION': 'End of the story')]
How to achieve the above case in case some free text are coming up. Any help is appreciated.....

I don't think you'll be able to use a look behind assertion in most cases because the case will be of unknown length.
You could try matching for the beginning of the text (if the paragraph you've shown us is typically how the string looks).
This will match the cases without free text:
^\b([A-Z]+(?:\s+[A-Z]+)):\s(.?)(?=\s\b(?:[A-Z]+(?:\s+[A-Z]+)*):|$)
If you have cases that are in the form:
^\b([A-Z][a-z]+[.])
Then you know you have a case that starts with free text because of the absence of an all capitalized key word. So you could add something like this to the start of your regex and just have two different cases for matching.

Related

Regex for multiple lines separated with "return" and multiple unnecessary spaces

I was trying to parse together a script for a movie into a dataset containing two columns 'speaker_name' and 'line_spoken'. I don't have any issue with the Python part of the problem but parsing the script is the problem.
The schema of the script goes like this:
Now, this, if copied and pasted into a .txt file is something like this:
ARTHUR
Yeah. I mean, that's just--
SOCIAL WORKER
Does my reading it upset you?
He leans in.
ARTHUR
No. I just,-- some of it's
personal. You know?
SOCIAL WORKER
I understand. I just want to make
sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
In the above case, the regex filtering should return the speaker name and the dialogue and not what is happening in actions like the last line: "slides his journal back". The dialogues often exceed more than two lines so please do not provide hard-coded solutions for 2 lines only. I think I am thinking about this problem in just one direction, some other method to filter can also work.
I have worked with scripts that are colon-separated and I don't have any problem parsing those. But in this case, I am getting no specific endpoints to end the search at. It would be a great help if the answer you give has 2 groups, one with name, the other with the dialogue. Like in the case of colon-separated, my regex was:
pattern = r'(^[a-zA-z]+):(.+)'
Also, if possible, please try and explain why you used that certain regex. It will be a learning experience for me.
Use https://www.onlineocr.net/ co convert pdf to text,
It shows immediately the outcome, where names and on the same line with dialogs,
which could allow for a simple processing
ARTHUR Yeah. I mean, that's just--
SOCIAL WORKER Does my reading it upset you?
He leans in.
ARTHUR No. I just,-- some of its personal. You know me ?
SOCIAL WORKER I understand. I just want to make sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
Not sure will it work for longer dialogs.
Another solution is to extract data from the text file that you can download by clicking the "download output file" link . That file is formatted differently. In that file
10 leading spaces will indicate the dialog, and 5 leading spaces the name - a the least for you sample screenshot
The regex is
r" (.+)(\n( [^ ].+\n)+)"
https://regex101.com/r/FQk8uH/1
it puts in group 1 whatever starts with ten spaces and whatever starts with at the exactly five space into the second :
the subexpression " [^ ].+\n" denotes a line where the first five symbols are spaces, the sixth symbol is anything but space, and the rest of symbols until the end of line are arbitrary. Since dialogs tend to be multiline that expression is followed with another plus.
You will have to delete extra white space from dialogue with additional code and/or regex.
If the amount of spaces varies a bit (say 4-6 and 7 - 14 respectively) but has distinct section the regex needs to be adjusted by using variable repetition operator (curly braces {4, 6}) or optional spaces ?.
r" {7, 14}(.+)(\n( {4-6}[^ ].+\n)+)"
The last idea is to use preexisting list of names in play to match them e.g. (SOCIAL WORKER|JOHN|MARY|ARTUR). The https://www.onlineocr.net/ website still could be used to help spot and delete actions
In Python, you can use DOTALL:
re_pattern = re.compile(r'(\b[A-Z ]{3,}(?=\n))\n*(.*?)\n*(?=\b[A-Z ]{3,}\n|$)', re.DOTALL)
print(re.findall(re_pattern, mystr))
\b[A-Z ]{3,}(?=\n) matches speaker name.
\b matches a word boundary
[A-Z ]{3,} matches three or more upper case letters or spaces. (this means this regex won't recognize speaker names with less than three characters. I did this to avoid false positives in special cases but you might wanna change it. Also check what kind of characters might occur in speaker name (dots, minus, lower case...))
(?=\n) is a lookahead insuring the speaker name is directly followed by a new line (avoids false positive if a similar expression appears in a spoken line)
\n* matches newlines
(.*?) matches everything (including new lines thanks to DOTALL) until the next part of the expression (? makes it lazy instead of greedy)
\n* matches newlines
(?=\b[A-Z ]{3,}\n|$) is a lookahead i.e. a non capturing expression insuring the following part is either a speaker name or the end of your string
Output:
[('ARTHUR', "Yeah. I mean, that's just--"), ('SOCIAL WORKER', 'Does my reading it upset you?\n\nHe leans in.'), ('ARTHUR', "No. I just,-- some of it's\n\npersonal. You know?"), ('SOCIAL WORKER', "I understand. I just want to make\n\nsure you're keeping up with it.\n\nShe slides his journal back to him. He holds it in his lap.")]
You'll have to adjust formatting if you want to remove actions from the result though.

Need help in RegEx to grab anything after a mandatory value

I have a text in which I need to grab data and split it up. I need to find "Review frequency" within a large group of text, then once that is found, take everything after it and stop at the ')'.
Example text is:
No. of components Variable
Review frequency Quarterly (Mar., Jun., Sep., Dec.)
Quick facts
To learn more about the
What I need is 'Quarterly' and 'Mar., Jun., Sep., Dec.'
My current regex is:
((?=.*?\bReview frequency\b)(\b(Q|q)uarterly|(A|a)nnually|(S|s)emi-(A|a)nnually))
But this is not working. Essentially the 'Review frequency' needs to be the qualifier before we start picking up the other information, as there may be other dates/periods within the file. Thank you!
You are not matching the rest of the data on the line.
I suggest using:
(?m)^Review frequency[ \t]+(\w+)[ \t]+(.+)
See the regex demo
If the first capturing group can only contain 3 words as indicated in your pattern, use
(?m)^Review frequency[ \t]+([Qq]uarterly|(?:[Ss]emi-)?[Aa]nnually)[ \t]+(.*)
See another regex demo
Use these patterns with re.findall:
import re
regex = r"(?m)^Review frequency[ \t]+([Qq]uarterly|(?:[Ss]emi-)?[Aa]nnually)[ \t]+(.*)"
test = "No. of components Variable\nReview frequency Quarterly (Mar., Jun., Sep., Dec.\nQuick facts\nTo learn more about the"
print(re.findall(regex, test))

Python regular expression grabbing paragraphs from old HTML

I am working on transferring old content from a website, written in some old HTML, to their new WordPress site. I am using Python to do this. I am trying to
get the content from the old HTML pages using urllib.request
Use a regular expression to grab the text of HTML <p> elements that have classes that identify them as the body of the text
use XML-RPC methods to upload the content to the new WordPress site.
I'm ok with #1 and #3. The problem I am having is with #2, writing the regular expression to capture the content.
The content is in paragraphs that have varying format. Below are two representative examples of two paragraphs that I am trying to extract their content using a regular expression.
Paragraph #1
<p class=bodyDC style='text-indent:12.0pt'><span style='font-size:14.0pt;
mso-bidi-font-size:10.0pt'>We have no need to fear the future." So said
bishop-elect H. George Anderson at a news conference immediately following his election as
bishop of the Evangelical Lutheran Church in America. "[The
future] belongs­ to God, untouched by human hands." At the beginning of a
new ministry of leadership and pastoral oversight, such words from a bishop are
obviously designed to project confidence and a profound sense of trust in the
mission of the Church. They are words designed to inspire and empower the
people of God for ministry.<o:p></o:p></span></p>
Paragraph #2
<p class=BODY><span style='font-size:14.0pt;mso-bidi-font-size:10.0pt'>Ages
ago, another prophet of the people stood at his station and peered into the
future. The<span style="mso-spacerun: yes"> </span>prophet Habakkuk poised on
the rampart, scanned the horizon for the approaching enemy he knew was coming.
As he waited, Habakkuk prayed to God asking why God was unresponsive to all
this violence and destruction. In Habakkuk chapter 2 the prophet records God's
answer to his questions about the future. God says to the fearful one, "For
there is still a vision for the appointed time;… If it seems to tarry, wait for
it; it will surely come, it will not delay…the righteous live by faith"
(2:3-4).<o:p></o:p></span></p>
Ideally my regular expression would identify content paragraphs by their class of BODY or bodyDC. Once it has identified a paragraph containing text content, it would ignore all the HTML elements preceding and following the text content, and simply grab the text content.
The regular expression I have so far is still a work in progress:
post_content_re = re.compile(r'<p class=(body\w*)(.*?>)(<.*?>)*([a-z])', re.IGNORECASE)
My explanation for my regular expression parts:
class=(body\w*) should match either BODY or bodyDC, but it doesn't, it only matches BODY, and I don't know why
(.*?>) match the remaining attributes in the paragraph element
(<.*?>)* match 0 or more html elements enclosed in <> after the paragraph element
([a-z]) The content I am trying to get would be after any HTML elements. Right now I'm just testing for one letter, not the full paragraph text, because I'm still testing.
The matches I am getting all look like this:
BODY- but I expected BODY or bodyDC
> - this is the closing > of the p element with class BODY
<span style='font-size:14.0pt;mso-bidi-font-size:10.0pt'> - this is the span element after the P element
A - this is the first letter after the span element
So essentially, my RE is matching paragraphs like Paragraph #2 above, but not like Paragraph #1. I'm not sure why, and I'm stuck.
Thank you for any help.
I would follow a two step approach to this problem.
first collect all the paragraphs of interest
second extract the text from each paragraph
First
Parse out all the paragraphs that have the desired class.
<p\s*(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sclass=(['"]?)(?:body|bodydc)\1(?:\s|>)(?:([^<]*)|<(?!\/p)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>)*(?=<\/p>)
This regex will do the following:
find all the paragraph tags of the given class upto but not including the close </p>
avoids some odd edge cases problems like <span onmouseover=" </p> ">
due to regex limitations this will not work with nested paragraph tags like <p>outside paragraph<p>inside paragraph</p>more text in the outside</p>
See Live Demo
Second
Extract the raw text from each paragraph
(?:([^<]*)|<(?!\/p)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>)
This regex will do the following:
match both the raw text and tags
place the raw text into capture group 1
avoid difficult edge cases
See Live Demo
While (as someone commented) you should not parse HTML like this, for this one-off job this kind of solution might just work.
Your regex is not working for the first paragraph because . does not match newlines, and you have a newline inside your tag. You can use tricks like [\S\s] to match all characters, including newlines.
This one does not remove the tags at the end of the paragraph, but I hope it still helps:
for g1, g2, content in re.findall("<p (class=bodyDC|class=BODY)[^><]*>(<[\S\s]*?>)*([\S\s]*?)<\\/p>", str1):
print content
Bit of explanation:
<p (class=bodyDC|class=BODY)[^><]*> matches the opening paragraph tag
<p: the beginning of the tag
(class=bodyDC|class=BODY): one of the two class attributes
[^><]*: any other attributes inside the tag
>: the end of the tag
(<[\S\s]*?>)* matches any number of tags
<: the beginning of the tag
[\S\s]*?: any other attributes (could have also used [^><]*)
>: end of tag
([\S\s]*?) matches any text. This is group 3, this is basically the content. (Plus the tags at the end of it.)
<\/p> matches the closing paragraph tag. (Note that in the code it actually appears as <\\/p>, because the backslash has to be escaped in the python string.)

Regular Expressions: Find Names in String using Python

I have never had a very hard time with regular expressions up until now. I am hoping the solution is not obvious because I have probably spent a few hours on this problem.
This is my string:
<b>Carson Daly</b>: Ben Schwartz, Soko, Jacob Escobedo (R 2/28/14)<br>'
I want to extract 'Soko', and 'Jacob Escobedo' as individual strings. If I takes two different patterns for the extractions that is okay with me.
I have tried "\s([A-Za-z0-9]{1}.+?)," and other alterations of that regex to get the data I want but I have had no success. Any help is appreciated.
The names never follow the same tag or the same symbol. The only thing that consistently precedes the names is a space (\s).
Here is another string as an example:
<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>
An alternative approach would be to parse the string with an HTML parser, like lxml.
For example, you can use the xpath to find everything between a b tag with Carson Daly text and br tag by checking preceding and following siblings:
from lxml.html import fromstring
l = [
"""<b>Carson Daly</b>: Ben Schwartz, Soko, Jacob Escobedo (R 2/28/14)<br>'""",
"""<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>"""
]
for html in l:
tree = fromstring(html)
results = ''
for element in tree.xpath('//node()[preceding-sibling::b="Carson Daly" and following-sibling::br]'):
if not isinstance(element, str):
results += element.text.strip()
else:
text = element.strip(':')
if text:
results += text.strip()
print results.split(', ')
It prints:
['Ben Schwartz', 'Soko', 'Jacob Escobedo (R 2/28/14)']
['Wil Wheaton', 'the Birds of Satan', 'Courtney Kemp Agboh']
If you want to do it in regex (and with all the disclaimers on that topic), the following regex works with your strings. However, do note that you need to retrieve your matches from capture Group 1. In the online demo, make sure you look at the Group 1 captures in the bottom right pane. :)
<[^<]*</[^>]*>|<.*?>|((?<=,\s)\w[\w ]*\w|\w[\w ]*\w(?=,))
Basically, with the left alternations (separated by |) we match everything we don't want, then the final parentheses on the right capture what we do want.
This is an application of this question about matching a pattern except in certain situations (read that for implementation details including links to Python code).

Searching for one of two complexish regex patterns in Python without creating submatches

I'm parsing some TV episodes that have been transcribed by different people, meaning I need to search for a variety of formats. For example, new scenes are indicated one of two ways:
[A coffee shop]
or
INT. Coffee shop - NIGHT
Right now, I match this with the following regex in Python:
re.findall("(^\[(.+?)\]$)|(^[INTEXT]{3}\. .+?$)", text)
where "text" is the text of the entire script (hence using findall). This always appears on its own line, hence the ^$
This gives me something like: (None, None, "INT. Coffee Shop - NIGHT") for example.
My question: How do you construct a regex to search for one of two complex patterns, using the | notation, without also creating submatches that you don't really want? Or is there a better way?
Many thanks.
UPDATE: I had overlooked the idea of non-capturing groups. I can accomplish what I want with:
"(?:^\[.+?\]$)|(?:^[INTEX]{3}\. .+?$)"
However, this raises a new question. I don't actually want the brackets or the INT/EXT in the scenes, just the location. I thought that I could use actual groups within the none-capturing groups, but I'm still getting those blank matches for the other expression, like so:
import re
pattern = "(?:^\[(.+?)\]$)|(?:^[INTEX]{3}\. (.+?)$)"
examples = [
"[coffee shop]",
"INT. COFFEE SHOP - DAY",
"EXT. FIELD - NIGHT",
"[Hugh's aparment]"
]
for example in examples:
print re.findall(pattern, example)
'''
[('coffee shop', '')]
[('', 'COFFEE SHOP - DAY')]
[('', 'FIELD - NIGHT')]
[("Hugh's aparment", '')]
'''
I can just join() them, but is there a better way?
Based on the limited examples you've provided, how about using assertions for the brackets:
re.findall("((?<=^\[)[^[\]]+(?=\]$)|^[INTEXT]{3}\. .+?$)", text)
You may be better off just using two expressions.
patterns = [r'^\[(.+?)\]$', r'^(?:INT|EXT)\. (.+?)$']
for example in examples:
print re.findall(patterns[0], example) or re.findall(patterns[1], example)
This seems to do what you want:
(?m)^(?=(?:\[|[INTEX]{3}\.\s+)([^\]\r\n]+))(?:\[\1\]|[INTEX]{3}\. \1)$
First the lookahead peeks at the text of the scene marker, capturing it in group #1. Then the rest of the regex goes ahead and consumes the whole line containing the marker. Although now I think about it, you don't really have to consume anything. This works, too:
result = re.findall(r"(?m)^(?=(?:\[|[INTEX]{3}\.\s+)([^\]\r\n]+))", subject)
The marker text is still captured in group #1, so it still gets added to the result of findall(). Then again, I don't see why you would want to use findall() here. If you're trying to normalize the scene markers by replacing them in place, you'll have to use the consuming version of the regex.
Also, notice the (?m). In your examples you always apply the regex to the scene markers in isolation. To pluck them out of the whole script, you have to set the MULTILINE flag, turning ^ and $ into line anchors.

Categories

Resources