Regex parse Buffy Script using look behinds - python

I'm having a difficult time parsing this page: http://www.buffyworld.com/buffy/transcripts/114_tran.html
I'm attempting to get the character name with the associated dialogue.
The text looks like this:
<p>BUFFY: Wait!
<p>She stands there panting, watching the truck turn a corner.
<p>BUFFY: (whining) Don't you want your garbage?
<p>She sighs, pouts, turns and walks back toward the house.
<p>Cut to the kitchen. Buffy enters through the back door, holding a pile of
mail. She begins looking through it. We see Dawn standing by the island.
<p>DAWN: Hey Buffy. Oh, don't forget, today's trash day.<br>BUFFY: (sourly)
Thanks.
<p>Dawn piles her books into her school bag. Buffy opens a letter.
<p>Close shot of the letter.
<p>
<p>Dawn smiles, and she and Willow exit. Buffy picks up the still-wrapped
sandwich and stares at it.
<p>BUFFY: (to herself) Somebody should.
<p>She sighs, puts the sandwich back in the bag.
<p>Cut to the Bronze. Pan across various people drinking and dancing,
bartender serving. Reveal Xander and Anya sitting at the bar eating chips from
several bags. A notebook sits in front of them bearing the wedding seating
chart.
<p>ANYA: See ... this seating chart makes no sense. We have to do it again.
(Xander nodding) We can't do it again. You do it.<br>XANDER: The seating
chart's fine. Let's get back to the table arrangements. I'm starting to have
dreams of gardenia bouquets. (winces) I am so glad my manly coworkers didn't
just hear me say that. (eating chips)
Ideally, I'd match from <p> or <br> to the next <p> or <br>. I was trying to use look aheads and look behinds for this:
reg = "((?<=<p>)|(?<=<br>))(?P<character>.+):(?P<dialogue>.+)((?=<p>)|(?=<br>))"
script = re.findall(reg, html_text)
Unfortunately, this doesn't match anything. When I leave off the lookahead ((?=<p>)|(?=<br>)), I match lines as long as there isn't a newline in the matching dialogue. It seems to terminate at the newline instead of continuing to the <p>
ex. On this line, the "Thanks" isn't matched. <p>DAWN: Hey Buffy. Oh, don't forget, today's trash day.<br>BUFFY: (sourly)
Thanks.
Thank you for any insight you have!

Work around the dot notation:
re.findall('((?<=<p>)|(?<=<br>))([A-Z]+):([^<]+)', text)
Also you can try a special flag to include linebreaks into the semantics of the dot. Personally, when I can I use splits or some html parser. RE escaping, all the parameters, limitations and flags can drive mad anyone. There is also re.split.
dialogs = {}
text = html_text.replace('<br>', '<p>')
paragraphs = text.split('<p>')
for p in paragraphs:
if ":" in p:
char, line = p.split(":", 1)
if char in dialogs:
dialogs[char].append(line)
else:
dialogs[char] = []

Related

Using Python to recognize to what character a line is directed towards in Shakespeare’s plays

I'm new to Python, and I’m using Python to extract lines said by certain characters in Shakespeare’s plays. I'm using a .txt file of Romeo and Juliet which essentially works as follows:
Jul. Wilt thou be gone? It is not yet near day.
It was the nightingale, and not the lark,
That pierc'd the fearful hollow of thine ear.
Nightly she sings on yond pomegranate tree.
Believe me, love, it was the nightingale.
Rom. It was the lark, the herald of the morn;
No nightingale. Look, love, what envious streaks
Do lace the severing clouds in yonder East.
Night's candles are burnt out, and jocund day
Stands tiptoe on the misty mountain tops.
I must be gone and live, or stay and die.
Jul. Yond light is not daylight; I know it, I.
It is some meteor that the sun exhales
To be to thee this night a torchbearer
And light thee on the way to Mantua.
Therefore stay yet; thou need'st not to be gone.
Rom. Let me be ta'en, let me be put to death.
I am content, so thou wilt have it so.
I'll say yon grey is not the morning's eye,
'Tis but the pale reflex of Cynthia's brow;
Nor that is not the lark whose notes do beat
The vaulty heaven so high above our heads.
I have more care to stay than will to go.
Come, death, and welcome! Juliet wills it so.
How is't, my soul? Let's talk; it is not day.
Jul. It is, it is! Hie hence, be gone, away!
It is the lark that sings so out of tune,
Straining harsh discords and unpleasing sharps.
Some say the lark makes sweet division;
This doth not so, for she divideth us.
Some say the lark and loathed toad chang'd eyes;
O, now I would they had chang'd voices too,
Since arm from arm that voice doth us affray,
Hunting thee hence with hunt's-up to the day!
O, now be gone! More light and light it grows.
Rom. More light and light- more dark and dark our woes!
The assumption I've made is that a line is directed towards the character that spoke directly before. For example, I assume that the last line of this text (' More light and light- more dark and dark our woes!') is directed towards Juliet (or Jul.).
I'm trying to extract all the lines spoken by Romeo, which are directed towards Juliet, using Regular Expression. This is the code I have so far:
def get_sentences(full_text):
sentences = sent_tokenize(full_text.strip())
return sentences
sentences = get_sentences(full_text)
lines = []
for lines in sentences:
if re.findall("\ARom.",lines):
print(lines)
However, this only returns a list as follows:
Rom. Rom. Rom. Rom. etc.
I've been trying to figure out what to do for hours, but I can't figure out what my next step should be.
Any help is greatly appreciated!
It looks like the pattern is that the first 'sentence' in lines is the characters name. So maybe you can split lines on the first period and take the first sentence as the name.
You could do that by using split() like:
character = lines.split('.')[0]
You might read all lines at once, and with multiline enabled using re.M write a pattern like:
^Rom\. .*(?:\n(?!(?:Rom|Jul)\. ).*)*
Explanation
^ Start of string
Rom\. Match Rom.
.* Match the whole line
(?: Non capture group
\n Match a newline - (?!(?:Rom|Jul)\. ).* Only match the whole line if it does not start with Rom. or Jul.
)* Optionally repeat the non capture group to match all lines
See a regex demo and a Python demo.

Regex for multiple lines separated with "return" and multiple unnecessary spaces

I was trying to parse together a script for a movie into a dataset containing two columns 'speaker_name' and 'line_spoken'. I don't have any issue with the Python part of the problem but parsing the script is the problem.
The schema of the script goes like this:
Now, this, if copied and pasted into a .txt file is something like this:
ARTHUR
Yeah. I mean, that's just--
SOCIAL WORKER
Does my reading it upset you?
He leans in.
ARTHUR
No. I just,-- some of it's
personal. You know?
SOCIAL WORKER
I understand. I just want to make
sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
In the above case, the regex filtering should return the speaker name and the dialogue and not what is happening in actions like the last line: "slides his journal back". The dialogues often exceed more than two lines so please do not provide hard-coded solutions for 2 lines only. I think I am thinking about this problem in just one direction, some other method to filter can also work.
I have worked with scripts that are colon-separated and I don't have any problem parsing those. But in this case, I am getting no specific endpoints to end the search at. It would be a great help if the answer you give has 2 groups, one with name, the other with the dialogue. Like in the case of colon-separated, my regex was:
pattern = r'(^[a-zA-z]+):(.+)'
Also, if possible, please try and explain why you used that certain regex. It will be a learning experience for me.
Use https://www.onlineocr.net/ co convert pdf to text,
It shows immediately the outcome, where names and on the same line with dialogs,
which could allow for a simple processing
ARTHUR Yeah. I mean, that's just--
SOCIAL WORKER Does my reading it upset you?
He leans in.
ARTHUR No. I just,-- some of its personal. You know me ?
SOCIAL WORKER I understand. I just want to make sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
Not sure will it work for longer dialogs.
Another solution is to extract data from the text file that you can download by clicking the "download output file" link . That file is formatted differently. In that file
10 leading spaces will indicate the dialog, and 5 leading spaces the name - a the least for you sample screenshot
The regex is
r" (.+)(\n( [^ ].+\n)+)"
https://regex101.com/r/FQk8uH/1
it puts in group 1 whatever starts with ten spaces and whatever starts with at the exactly five space into the second :
the subexpression " [^ ].+\n" denotes a line where the first five symbols are spaces, the sixth symbol is anything but space, and the rest of symbols until the end of line are arbitrary. Since dialogs tend to be multiline that expression is followed with another plus.
You will have to delete extra white space from dialogue with additional code and/or regex.
If the amount of spaces varies a bit (say 4-6 and 7 - 14 respectively) but has distinct section the regex needs to be adjusted by using variable repetition operator (curly braces {4, 6}) or optional spaces ?.
r" {7, 14}(.+)(\n( {4-6}[^ ].+\n)+)"
The last idea is to use preexisting list of names in play to match them e.g. (SOCIAL WORKER|JOHN|MARY|ARTUR). The https://www.onlineocr.net/ website still could be used to help spot and delete actions
In Python, you can use DOTALL:
re_pattern = re.compile(r'(\b[A-Z ]{3,}(?=\n))\n*(.*?)\n*(?=\b[A-Z ]{3,}\n|$)', re.DOTALL)
print(re.findall(re_pattern, mystr))
\b[A-Z ]{3,}(?=\n) matches speaker name.
\b matches a word boundary
[A-Z ]{3,} matches three or more upper case letters or spaces. (this means this regex won't recognize speaker names with less than three characters. I did this to avoid false positives in special cases but you might wanna change it. Also check what kind of characters might occur in speaker name (dots, minus, lower case...))
(?=\n) is a lookahead insuring the speaker name is directly followed by a new line (avoids false positive if a similar expression appears in a spoken line)
\n* matches newlines
(.*?) matches everything (including new lines thanks to DOTALL) until the next part of the expression (? makes it lazy instead of greedy)
\n* matches newlines
(?=\b[A-Z ]{3,}\n|$) is a lookahead i.e. a non capturing expression insuring the following part is either a speaker name or the end of your string
Output:
[('ARTHUR', "Yeah. I mean, that's just--"), ('SOCIAL WORKER', 'Does my reading it upset you?\n\nHe leans in.'), ('ARTHUR', "No. I just,-- some of it's\n\npersonal. You know?"), ('SOCIAL WORKER', "I understand. I just want to make\n\nsure you're keeping up with it.\n\nShe slides his journal back to him. He holds it in his lap.")]
You'll have to adjust formatting if you want to remove actions from the result though.

string.punctuation fails to remove certain characters from a string

My aim is to remove all punctuations from a string so that I can then get the frequency of each word in the string.
My string is:
WASHINGTON—Coming to the realization in front of millions of viewers
during the broadcast of his show, a horrified Tucker Carlson stated,
‘I…I am the mainstream media’ Wednesday as he began spiraling live on
air. “We’ve discovered evidence of rampant voter fraud, and the
president has every right to call for an investigation even if the
mainstream media thinks...” said Carlson, who trailed off, stared down
at his shaking hands, and felt a sudden ringing in his ears as he
looked back up and zeroed in on the production crew surrounding him.
“The media says…wait. Those liars on TV will try to tell you…oh God.
We’re the number-one program on cable news, aren’t we? Fox News…Fox
‘News.’ It’s the media. It’s me. This can’t be. No, no, no, no. Jesus
Christ, I make $6 million a year. Get that camera off me!” At press
time, Carlson had torn the microphone from his lapel and fled the set
in panic.
source: https://www.theonion.com/i-i-am-the-mainstream-media-realizes-horrified-tuc-1845646901
I want to remove all punctuations from it. I do that like this -
s.translate(str.maketrans('', '', string.punctuation))
This is the output -
WASHINGTON—Coming to the realization in front of millions of viewers
during the broadcast of his show a horrified Tucker Carlson stated
‘I…I am the mainstream media’ Wednesday as he began spiraling live on
air “We’ve discovered evidence of rampant voter fraud and the
president has every right to call for an investigation even if the
mainstream media thinks” said Carlson who trailed off stared down at
his shaking hands and felt a sudden ringing in his ears as he looked
back up and zeroed in on the production crew surrounding him “The
media says…wait Those liars on TV will try to tell you…oh God We’re
the numberone program on cable news aren’t we Fox News…Fox ‘News’ It’s
the media It’s me This can’t be No no no no Jesus Christ I make 6
million a year Get that camera off me” At press time Carlson had torn
the microphone from his lapel and fled the set in panic
As you can see that characters/ string like ", — and ... still exist. Am I incorrectly expecting them to be removed too? If the output is correct then how can I NOT differentiate between "`News`" and "News"?
>>> import string
>>> "“" in string.punctuation
False
>>> "—" in string.punctuation
False
Welcome to the wonderful world of Unicode where, among many other things, … is not three concatenated full stop periods and :
>>> import unicodedata
>>> unicodedata.name('—')
'EM DASH'
is not a hyphen.
How you want to handle the full scope of what could be considered 'punctuation' across the Unicode table is probably out of scope for this question, but you could either come up with your own ad-hoc list or use a third-party library designed for that type of text manipulation. Here is one such approach:
Best way to strip punctuation from a string
I added the list of characters you can remove from string by using your implementation.
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
You can check this implementation to remove all special characters and keep whitespaces
''.join(e for e in s if e.isalnum() or e == ' ')
It looks like the … and a couple of the other characters you are having trouble with are special Unicode characters. A workaround is to use string.isalpha(), which tells you whether the characters of a string are part of the alphabet or not.
result = ""
for x in string:
if x.isalpha() or x == " ":
result = result + x

How to capture from whitespace +{n} to next {n} in regex

I've cleaned up a document to allow me to properly rip it verse by verse. Being weak in regex I cannot seem to find the right expression to extract these verses.
This is the expression I am using:
(\t?\t?{\d+}.*){
And I'm doing this in python, though I expect that does not matter.
How should I change this to make it simply highlight verses {x} some verse {x} next verse, but stopping short just of the next brace?
As you can see, I'm trying to keep it tabs-aware because this doc gives some attention to verse-style writing.
And here is an example doc:
{1} The words of the blessing of Enoch, wherewith he blessed the elect [[[[and]]]] righteous, who will be living in the day of tribulation, when all the wicked [[[[and godless]]]] are to be removed. {2} And he took up his parable and said--Enoch a righteous man, whose eyes were opened by God, saw the vision of the Holy One in the heavens, [[which]] the angels showed me, and from them I heard everything, and from them I understood as I saw, but not for this generation, but for a remote one which is for to come. {3} Concerning the elect I said, and took up my parable concerning them:
The Holy Great One will come forth from His dwelling,
{4} And the eternal God will tread upon the earth, (even) on Mount Sinai,
[[And appear from His camp]]
And appear in the strength of His might from the heaven of heavens.
{5} And all shall be smitten with fear
And the Watchers shall quake,
And great fear and trembling shall seize them unto the ends of the earth.
{6} And the high mountains shall be shaken,
And the high hills shall be made low,
And shall melt like wax before the flame
{7} And the earth shall be [[wholly]] rent in sunder,
And all that is upon the earth shall perish,
And there shall be a judgement upon all (men).
{8} But with the righteous He will make peace.
And will protect the elect,
And mercy shall be upon them.
And they shall all belong to God,
And they shall be prospered,
And they shall [[all]] be blessed.
[[And He will help them all]],
And light shall appear unto them,
[[And He will make peace with them]].
{9} And behold! He cometh with ten thousands of [[His]] holy ones
To execute judgement upon all,
And to destroy [[all]] the ungodly:
And to convict all flesh
Of all the works [[of their ungodliness]] which they have ungodly committed,
And of all the hard things which ungodly sinners [[have spoken]] against Him.
[BREAK]
[CHAPTER 2]
Simply split the text on the verse markers with re.split:
import re
text = '''{1} The words of the blessing of Enoch, wherewith he blessed the elect [[[[and]]]] righteous, who will be living in the day of tribulation, when all the wicked [[[[and godless]]]] are to be removed. {2} And he took up his parable and said--Enoch a righteous man, whose eyes were opened by God, saw the vision of the Holy One in the heavens, [[which]] the angels showed me, and from them I heard everything, and from them I understood as I saw, but not for this generation, but for a remote one which is for to come. {3} Concerning the elect I said, and took up my parable concerning them:
The Holy Great One will come forth from His dwelling,
{4} And the eternal God will tread upon the earth, (even) on Mount Sinai,
[[And appear from His camp]]
And appear in the strength of His might from the heaven of heavens.'''
result = [i for i in re.split(r'\{\d+\}', text) if i]
result has four elements, corresponding to {1} through {4} above.
(\t?\t?{\d+}.*?)(?={)
See demo.
https://regex101.com/r/OCpDb7/1
Edit:
If you want to capture last verse as well,use
(\t?\t?{\d+}.*?)(?={|\[BREAK\])
See demo.
https://regex101.com/r/OCpDb7/2
Your original regex suffered from 2 problems.
(\t?\t?{\d+}.*){
^ ^
1)You had used greedy operator.Use non greedy .*?
2)You were capturing { which would not allow that verse to match as it has been already captured.Use lookahead to just assert and not capture.
The answer above is good, but the verses are not always incremented properly in this book (ie, it can jump from verse 5 to 7 due to manuscript details) so I had to retain the verses to "pluck the number" them later. Basically, entire verses along with the number had to be extracted.
The recipe seemed to be this:
verse = re.compile(r'([\t+]?{\d+}[^{]*)', re.DOTALL)
In context:
import re
f = open('thebook.txt', 'r').read()
chapters = f.split('[BREAK]')
verse = re.compile(r'([\t+]?{\d+}[^{]*)', re.DOTALL)
verses = re.findall(verse, chapters[1])
Please note, it seems to work properly, but I have to check the results to make sure it accounts for everything.

Using regex to split text content into dictionary

I have a text file that follows this format.
LESTER HOLT (00:00:01): Breaking News Tonight: A deadly mass shooting
at the airport. A gunman opens fire at baggage claim in Fort
Lauderdale, witnesses describing scenes of sheer horror. A silent
killer shooting people in the head as they tried to run and hide.
Tonight, a storm of questions. Why did he do it? The suspect, a
passenger with a firearm in his checked bag. New concerns about
airport security before the checkpoint.
(00:00:25): Also breaking tonight the new report from U.S.
intelligence: Vladimir Putin himself ordered the effort to influence
the election, aimed at hurting Clinton and helping Trump win. What the
President-elect is saying after his top-secret briefing.
(00:00:39): And States of Emergency: Millions from coast to coast
paralyzed by a massive winter storm.
(00:00:45): NIGHTLY NEWS begins right now.
I am trying to parse this information into a Python Dictionary, where the speaker is a dictionary, of dictionaries, which has timecode keys, and the content is the value, I can't consistently split because of potential information before the timecode, (IE the first quote), as well as the fact that the split character : is also a character involved with the timecode itself 00:00:00.
Trying to split according to the regex.
for line in msg.get_payload().split('\n'):
regex = r'\d{2}:\d{2}:\d{2}'
test = re.split(regex, line)
print(test)
sleep(1)
Appears to work in splitting it properly, but it causes me to lose the value I am splitting on (timecode), which I intend to use as a key. How can I properly split the above content, get the speaker, and then get the timecode as a key, and the content as a value.It is possible he may be present later in the text as well, and it should append to the list of timecodes./
The output format I am targeting is something along the lines of
{speakers:{'Lester Holt': {'00:00:01':content..., '00:00:0025': content...},
'speaker2':{etc,etc,etc} }}
Ive tried using the split as mentioned above, but it removes my timecode variable.
Any thoughts and guidance is appreciated.
Don't bother with split. You're trying to get 2-3 pieces of information out of each line, so try the following:
for line in msg.get_payload().split('\n'):
match = re.search(r'^\s*([^(]*?)\s*\((\d{2}:\d{2}:\d{2})\):\s*(.*)$', line)
if match:
(speaker, time, message) = match.groups()
Speaker will be an empty string if none was present on that line.
Regex explanation:
^ # Start of line
\s* # Drop leading whitespace
([^(]*?) # Capture the speaker if present (non-paren characters)
\s* # Drop whitespace between name and time
\( # Drop literal open paren
(\d{2}:\d{2}:\d{2}) # Capture time
\):\s* # Drop literal close paren, colon and whitespace
(.*) # Capture the rest of the line
$ # End of line
Splitting message in lines when you need to split it in time-stamped paragraphs is a waste. re.split can easily save the tokens that it split on, if you only look at the documentation. Here's my solution:
toks = re.split(r"\((\d\d:\d\d:\d\d)\):", msg.get_payload())[1:]
answer = dict(zip(toks[::2], toks[1::2]))
This creates a dictionary of timestamps and paragraphs. Feel free to use the same approach to split by speaker as well.
Result:
{
'00:00:01': ' Breaking News Tonight: A .....',
'00:00:25': ' Also breaking tonight ......', ....
}

Categories

Resources