How to make the following substitution in string using regex in Python? - python

I'm trying to make a substitution in the following string:
poem='''
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aching,
Or cool one pain,
Or help one fainting robin
Unto his nest again,
I shall not live in vain.
'''
The requirement are as below in the given string:
If the pattern has characters 'ai' or 'hi', replace the next three characters with *\*.
If a word has 'ch' or 'co', replace it with 'Ch' or 'Co'.
I tried the following methods:
print(re.sub(r"ai\w{3}|hi\w{3}",r"(ai|hi)*\*",poem))
Output:
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aching,
Or cool one pain,
Or help one f(ai|hi)*\*ng robin
Unto his nest again,
I shall not live in vain.
print(re.sub(r"ch|co",r"Ch|Co",poem))
Output:
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aCh|Coing,
Or Ch|Cool one pain,
Or help one fainting robin
Unto his nest again,
I shall not live in vain.
You can see the output is not as per the requirements. Please help me find the correct regex expression.

The first you can achieve by referencing a captured group from the pattern in the replacement:
poem = re.sub(r"(ai|hi)\w{3}", "\g<1>*\*", poem)
For the second, you can pass a function as replacement (see the re.sub docs):
def title(match):
return match.group(0).title() # or .capitalize()
poem = re.sub(r"ch|co", title, poem)

import re
poem = re.sub(r'(ai|hi)(...)', r'\1*\*', poem)
poem = re.sub('ch', 'Ch', poem)
poem = re.sub('co', 'Co', poem)
print(poem)
This outputs:
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aChi*\*
Or Cool one pain,
Or help one fai*\*ng robin
Unto hi*\*est again,
I shall not live in vain.

You can replace those step wise:
poem='''
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aching,
Or cool one pain,
Or help one fainting robin
Unto his nest again,
I shall not live in vain.
'''
import re
p2 = re.sub("(?:ai|hi)...","*/*",poem)
p3 = re.sub("ch","Ch",p2)
p4 = re.sub("co","Co",p3)
print(p4)
Output:
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the ac*/*
Or Cool one pain,
Or help one f*/*ng robin
Unto */*est again,
I shall not live in vain.
The only interesting thing is a non-capturing group around ai|hi that does not work as I expected - ai and hi are still replaced. You might want to change them to:
p = re.sub("ai...","*/*",poem, flags = re.DOTALL)
p2 = re.sub("hi...","*/*",p, flags= re.DOTALL)
p3 = re.sub("ch","Ch",p2)
p4 = re.sub("co","Co",p3)
print(p4)
Output:
If I can stop one heart from breaking,
I shall not live in v*/*If I can ease one life the ac*/*
Or Cool one p*/*Or help one f*/*ng robin
Unto */*est ag*/*I shall not live in v*/*
The flag re.DOTALL lets . also match newline characters.
Without it, vain; would not be matched.

print(re.sub(r"co",r"Co",re.sub(r"ch",r"Ch",s)))
This works:
Input:
s='''It takes strength for being certain,
It takes courage to have doubt.
It takes strength for challenging alone,
It takes courage to lean on another.
It takes strength for loving other souls,
It takes courage to be loved.
It takes strength for hiding our own pain,
It takes courage to help if it is paining for someone.'''
Output:
It takes strength for being certain,
It takes Courage to have doubt.
It takes strength for Challenging alone,
It takes Courage to lean on another.
It takes strength for loving other souls,
It takes Courage to be loved.
It takes strength for hiding our own pain,
It takes Courage to help if it is paining for someone.

Here's an answer to your question:
import re
poem='''
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aching,
Or cool one pain,
Or help one fainting robin
Unto his nest again,
I shall not live in vain.
'''`
p1=poem
print(re.sub(r"\n","",poem))
poem=re.sub(r"co","Co",poem)
poem=re.sub(r"ch","Ch",poem)
print(poem)
print(re.sub(r"ai|hi{3}","*/*",p1))`

You can use | which acts as or to mention options and create groups using () to match and retain some of the groups by using \<group number> (1-indexed) in replace string
For first one you can make 2 groups to match (hi|ai) and to match next 3 characters like (...) and then replace only second group and retain the first group using \1
print(re.sub(r'(hi|ai)(...)', r'\1*\*', poem))
For second one you can make 2 groups to match (c) and (h|o) and retain second group using \2
print(re.sub(r'(c)(h|o)', r'C\2', poem))

Related

Using Python to recognize to what character a line is directed towards in Shakespeare’s plays

I'm new to Python, and I’m using Python to extract lines said by certain characters in Shakespeare’s plays. I'm using a .txt file of Romeo and Juliet which essentially works as follows:
Jul. Wilt thou be gone? It is not yet near day.
It was the nightingale, and not the lark,
That pierc'd the fearful hollow of thine ear.
Nightly she sings on yond pomegranate tree.
Believe me, love, it was the nightingale.
Rom. It was the lark, the herald of the morn;
No nightingale. Look, love, what envious streaks
Do lace the severing clouds in yonder East.
Night's candles are burnt out, and jocund day
Stands tiptoe on the misty mountain tops.
I must be gone and live, or stay and die.
Jul. Yond light is not daylight; I know it, I.
It is some meteor that the sun exhales
To be to thee this night a torchbearer
And light thee on the way to Mantua.
Therefore stay yet; thou need'st not to be gone.
Rom. Let me be ta'en, let me be put to death.
I am content, so thou wilt have it so.
I'll say yon grey is not the morning's eye,
'Tis but the pale reflex of Cynthia's brow;
Nor that is not the lark whose notes do beat
The vaulty heaven so high above our heads.
I have more care to stay than will to go.
Come, death, and welcome! Juliet wills it so.
How is't, my soul? Let's talk; it is not day.
Jul. It is, it is! Hie hence, be gone, away!
It is the lark that sings so out of tune,
Straining harsh discords and unpleasing sharps.
Some say the lark makes sweet division;
This doth not so, for she divideth us.
Some say the lark and loathed toad chang'd eyes;
O, now I would they had chang'd voices too,
Since arm from arm that voice doth us affray,
Hunting thee hence with hunt's-up to the day!
O, now be gone! More light and light it grows.
Rom. More light and light- more dark and dark our woes!
The assumption I've made is that a line is directed towards the character that spoke directly before. For example, I assume that the last line of this text (' More light and light- more dark and dark our woes!') is directed towards Juliet (or Jul.).
I'm trying to extract all the lines spoken by Romeo, which are directed towards Juliet, using Regular Expression. This is the code I have so far:
def get_sentences(full_text):
sentences = sent_tokenize(full_text.strip())
return sentences
sentences = get_sentences(full_text)
lines = []
for lines in sentences:
if re.findall("\ARom.",lines):
print(lines)
However, this only returns a list as follows:
Rom. Rom. Rom. Rom. etc.
I've been trying to figure out what to do for hours, but I can't figure out what my next step should be.
Any help is greatly appreciated!
It looks like the pattern is that the first 'sentence' in lines is the characters name. So maybe you can split lines on the first period and take the first sentence as the name.
You could do that by using split() like:
character = lines.split('.')[0]
You might read all lines at once, and with multiline enabled using re.M write a pattern like:
^Rom\. .*(?:\n(?!(?:Rom|Jul)\. ).*)*
Explanation
^ Start of string
Rom\. Match Rom.
.* Match the whole line
(?: Non capture group
\n Match a newline - (?!(?:Rom|Jul)\. ).* Only match the whole line if it does not start with Rom. or Jul.
)* Optionally repeat the non capture group to match all lines
See a regex demo and a Python demo.

Regex for multiple lines separated with "return" and multiple unnecessary spaces

I was trying to parse together a script for a movie into a dataset containing two columns 'speaker_name' and 'line_spoken'. I don't have any issue with the Python part of the problem but parsing the script is the problem.
The schema of the script goes like this:
Now, this, if copied and pasted into a .txt file is something like this:
ARTHUR
Yeah. I mean, that's just--
SOCIAL WORKER
Does my reading it upset you?
He leans in.
ARTHUR
No. I just,-- some of it's
personal. You know?
SOCIAL WORKER
I understand. I just want to make
sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
In the above case, the regex filtering should return the speaker name and the dialogue and not what is happening in actions like the last line: "slides his journal back". The dialogues often exceed more than two lines so please do not provide hard-coded solutions for 2 lines only. I think I am thinking about this problem in just one direction, some other method to filter can also work.
I have worked with scripts that are colon-separated and I don't have any problem parsing those. But in this case, I am getting no specific endpoints to end the search at. It would be a great help if the answer you give has 2 groups, one with name, the other with the dialogue. Like in the case of colon-separated, my regex was:
pattern = r'(^[a-zA-z]+):(.+)'
Also, if possible, please try and explain why you used that certain regex. It will be a learning experience for me.
Use https://www.onlineocr.net/ co convert pdf to text,
It shows immediately the outcome, where names and on the same line with dialogs,
which could allow for a simple processing
ARTHUR Yeah. I mean, that's just--
SOCIAL WORKER Does my reading it upset you?
He leans in.
ARTHUR No. I just,-- some of its personal. You know me ?
SOCIAL WORKER I understand. I just want to make sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
Not sure will it work for longer dialogs.
Another solution is to extract data from the text file that you can download by clicking the "download output file" link . That file is formatted differently. In that file
10 leading spaces will indicate the dialog, and 5 leading spaces the name - a the least for you sample screenshot
The regex is
r" (.+)(\n( [^ ].+\n)+)"
https://regex101.com/r/FQk8uH/1
it puts in group 1 whatever starts with ten spaces and whatever starts with at the exactly five space into the second :
the subexpression " [^ ].+\n" denotes a line where the first five symbols are spaces, the sixth symbol is anything but space, and the rest of symbols until the end of line are arbitrary. Since dialogs tend to be multiline that expression is followed with another plus.
You will have to delete extra white space from dialogue with additional code and/or regex.
If the amount of spaces varies a bit (say 4-6 and 7 - 14 respectively) but has distinct section the regex needs to be adjusted by using variable repetition operator (curly braces {4, 6}) or optional spaces ?.
r" {7, 14}(.+)(\n( {4-6}[^ ].+\n)+)"
The last idea is to use preexisting list of names in play to match them e.g. (SOCIAL WORKER|JOHN|MARY|ARTUR). The https://www.onlineocr.net/ website still could be used to help spot and delete actions
In Python, you can use DOTALL:
re_pattern = re.compile(r'(\b[A-Z ]{3,}(?=\n))\n*(.*?)\n*(?=\b[A-Z ]{3,}\n|$)', re.DOTALL)
print(re.findall(re_pattern, mystr))
\b[A-Z ]{3,}(?=\n) matches speaker name.
\b matches a word boundary
[A-Z ]{3,} matches three or more upper case letters or spaces. (this means this regex won't recognize speaker names with less than three characters. I did this to avoid false positives in special cases but you might wanna change it. Also check what kind of characters might occur in speaker name (dots, minus, lower case...))
(?=\n) is a lookahead insuring the speaker name is directly followed by a new line (avoids false positive if a similar expression appears in a spoken line)
\n* matches newlines
(.*?) matches everything (including new lines thanks to DOTALL) until the next part of the expression (? makes it lazy instead of greedy)
\n* matches newlines
(?=\b[A-Z ]{3,}\n|$) is a lookahead i.e. a non capturing expression insuring the following part is either a speaker name or the end of your string
Output:
[('ARTHUR', "Yeah. I mean, that's just--"), ('SOCIAL WORKER', 'Does my reading it upset you?\n\nHe leans in.'), ('ARTHUR', "No. I just,-- some of it's\n\npersonal. You know?"), ('SOCIAL WORKER', "I understand. I just want to make\n\nsure you're keeping up with it.\n\nShe slides his journal back to him. He holds it in his lap.")]
You'll have to adjust formatting if you want to remove actions from the result though.

Change string inside curly brackets (option separated by |)

I'm trying to change the text between the curly brackets from the following string:
s = "As soon as {female_character:Aurelia|Aurelius} turned around the corner, {female_character:she|he} remembered that it was the wrong way and would eventually end in a cul-de-sac. Spinning around, {female_character:Aurelia|Aurelius} tried to run back out, but the way was already blocked by the vendor. In this dark alley way, nobody would see or care what happened to some poor beggar turned thief. Should {female_character:Aurelia|Aurelius} put up a fight in hopes of lasting long enough to escape or give up now and trust to the mercy of the vendor?"
My question is this, how do I apply Python logic to the text in that string so that {female_character:Aurelia|Aurelius} in the text applies the logic of:
if (whatever is on the left side of the colon) == True:
(replace {female_character:Aurelia|Aurelius} with the option on the left side of the |)
else:
(replace {female_character:Aurelia|Aurelius} with the option on the right side of the |)
A couple of other points to note, the string is getting pulled from a json file and there will be many similar texts. Additionally, some of the braces with have braces within braces like so: {strong_character:is big for his age|{small_character:although small for his age, is a very quick warrior|although average size, is a skilled warrior}}
As I'm sure anyone can tell, I'm still new to coding and am trying to learn Python. So I apologize in advance for any ignorance on my part.
You can use a regular expression to locate the variables and their text replacements. Regular expressions support grouping, so you can grab both True and False in separate groups, and then, depending on the current value of the found variable, replace the entire match with the correct group.
With nested expressions it gets a bit harder, though. Best is to construct the regex in such way that it will not match the outer level of nesting. The first time around, the inner braced expressions will be replaced by plain text, and then a second loop will match and change the rest.
So it may take more than one replacement loop, but how many then? That depends on the number of nesting braces. You could set the loop to a 'surely large enough' number such as 10, but this has several disadvantages. For instance, you need to be sure you don't accidentally nest more than 10 times; and if you have a sentence with only one level of braces and no nesting, it will still loop 9 times more, doing nothing at all.
One way to counter this is by counting the number of nested braces. I think my findall regex does this correctly, but I could be wrong there.
import re
def replaceVars(vars,text):
for loop in range(len(re.findall(r'\{[^{}]*(?=\{)', text))+1):
for var in vars:
if vars[var]:
text = re.sub ('\{'+var+r':([^|{}]+)\|([^|{}]+?)\}', r'\1', text)
else:
text = re.sub ('\{'+var+r':([^|{}]+)\|([^|{}]+?)\}', r'\2', text)
return text
s = "As soon as {female_character:Aurelia|Aurelius} turned around the corner, {female_character:she|he} remembered that it was the wrong way and would eventually end in a cul-de-sac. Spinning around, {female_character:Aurelia|Aurelius} tried to run back out, but the way was already blocked by the vendor. In this dark alley way, nobody would see or care what happened to some poor beggar turned thief. Should {female_character:Aurelia|Aurelius} put up a fight in hopes of lasting long enough to escape or give up now and trust to the mercy of the vendor? Puppy {strong_character:is big for his age|{small_character:although small for his age, is a very quick warrior|although average size, {female_character:she|he} is a skilled warrior}}"
variables = {"female_character":True, "strong_character":False, "small_character":False}
t = replaceVars(variables,s)
print (t)
results in
As soon as Aurelia turned around the corner, she remembered that it was the wrong way and would eventually end in a cul-de-sac. Spinning around, Aurelia tried to run back out, but the way was already blocked by the vendor. In this dark alley way, nobody would see or care what happened to some poor beggar turned thief. Should Aurelia put up a fight in hopes of lasting long enough to escape or give up now and trust to the mercy of the vendor? Puppy although average size, she is a skilled warrior

Regex parse Buffy Script using look behinds

I'm having a difficult time parsing this page: http://www.buffyworld.com/buffy/transcripts/114_tran.html
I'm attempting to get the character name with the associated dialogue.
The text looks like this:
<p>BUFFY: Wait!
<p>She stands there panting, watching the truck turn a corner.
<p>BUFFY: (whining) Don't you want your garbage?
<p>She sighs, pouts, turns and walks back toward the house.
<p>Cut to the kitchen. Buffy enters through the back door, holding a pile of
mail. She begins looking through it. We see Dawn standing by the island.
<p>DAWN: Hey Buffy. Oh, don't forget, today's trash day.<br>BUFFY: (sourly)
Thanks.
<p>Dawn piles her books into her school bag. Buffy opens a letter.
<p>Close shot of the letter.
<p>
<p>Dawn smiles, and she and Willow exit. Buffy picks up the still-wrapped
sandwich and stares at it.
<p>BUFFY: (to herself) Somebody should.
<p>She sighs, puts the sandwich back in the bag.
<p>Cut to the Bronze. Pan across various people drinking and dancing,
bartender serving. Reveal Xander and Anya sitting at the bar eating chips from
several bags. A notebook sits in front of them bearing the wedding seating
chart.
<p>ANYA: See ... this seating chart makes no sense. We have to do it again.
(Xander nodding) We can't do it again. You do it.<br>XANDER: The seating
chart's fine. Let's get back to the table arrangements. I'm starting to have
dreams of gardenia bouquets. (winces) I am so glad my manly coworkers didn't
just hear me say that. (eating chips)
Ideally, I'd match from <p> or <br> to the next <p> or <br>. I was trying to use look aheads and look behinds for this:
reg = "((?<=<p>)|(?<=<br>))(?P<character>.+):(?P<dialogue>.+)((?=<p>)|(?=<br>))"
script = re.findall(reg, html_text)
Unfortunately, this doesn't match anything. When I leave off the lookahead ((?=<p>)|(?=<br>)), I match lines as long as there isn't a newline in the matching dialogue. It seems to terminate at the newline instead of continuing to the <p>
ex. On this line, the "Thanks" isn't matched. <p>DAWN: Hey Buffy. Oh, don't forget, today's trash day.<br>BUFFY: (sourly)
Thanks.
Thank you for any insight you have!
Work around the dot notation:
re.findall('((?<=<p>)|(?<=<br>))([A-Z]+):([^<]+)', text)
Also you can try a special flag to include linebreaks into the semantics of the dot. Personally, when I can I use splits or some html parser. RE escaping, all the parameters, limitations and flags can drive mad anyone. There is also re.split.
dialogs = {}
text = html_text.replace('<br>', '<p>')
paragraphs = text.split('<p>')
for p in paragraphs:
if ":" in p:
char, line = p.split(":", 1)
if char in dialogs:
dialogs[char].append(line)
else:
dialogs[char] = []

How to capture from whitespace +{n} to next {n} in regex

I've cleaned up a document to allow me to properly rip it verse by verse. Being weak in regex I cannot seem to find the right expression to extract these verses.
This is the expression I am using:
(\t?\t?{\d+}.*){
And I'm doing this in python, though I expect that does not matter.
How should I change this to make it simply highlight verses {x} some verse {x} next verse, but stopping short just of the next brace?
As you can see, I'm trying to keep it tabs-aware because this doc gives some attention to verse-style writing.
And here is an example doc:
{1} The words of the blessing of Enoch, wherewith he blessed the elect [[[[and]]]] righteous, who will be living in the day of tribulation, when all the wicked [[[[and godless]]]] are to be removed. {2} And he took up his parable and said--Enoch a righteous man, whose eyes were opened by God, saw the vision of the Holy One in the heavens, [[which]] the angels showed me, and from them I heard everything, and from them I understood as I saw, but not for this generation, but for a remote one which is for to come. {3} Concerning the elect I said, and took up my parable concerning them:
The Holy Great One will come forth from His dwelling,
{4} And the eternal God will tread upon the earth, (even) on Mount Sinai,
[[And appear from His camp]]
And appear in the strength of His might from the heaven of heavens.
{5} And all shall be smitten with fear
And the Watchers shall quake,
And great fear and trembling shall seize them unto the ends of the earth.
{6} And the high mountains shall be shaken,
And the high hills shall be made low,
And shall melt like wax before the flame
{7} And the earth shall be [[wholly]] rent in sunder,
And all that is upon the earth shall perish,
And there shall be a judgement upon all (men).
{8} But with the righteous He will make peace.
And will protect the elect,
And mercy shall be upon them.
And they shall all belong to God,
And they shall be prospered,
And they shall [[all]] be blessed.
[[And He will help them all]],
And light shall appear unto them,
[[And He will make peace with them]].
{9} And behold! He cometh with ten thousands of [[His]] holy ones
To execute judgement upon all,
And to destroy [[all]] the ungodly:
And to convict all flesh
Of all the works [[of their ungodliness]] which they have ungodly committed,
And of all the hard things which ungodly sinners [[have spoken]] against Him.
[BREAK]
[CHAPTER 2]
Simply split the text on the verse markers with re.split:
import re
text = '''{1} The words of the blessing of Enoch, wherewith he blessed the elect [[[[and]]]] righteous, who will be living in the day of tribulation, when all the wicked [[[[and godless]]]] are to be removed. {2} And he took up his parable and said--Enoch a righteous man, whose eyes were opened by God, saw the vision of the Holy One in the heavens, [[which]] the angels showed me, and from them I heard everything, and from them I understood as I saw, but not for this generation, but for a remote one which is for to come. {3} Concerning the elect I said, and took up my parable concerning them:
The Holy Great One will come forth from His dwelling,
{4} And the eternal God will tread upon the earth, (even) on Mount Sinai,
[[And appear from His camp]]
And appear in the strength of His might from the heaven of heavens.'''
result = [i for i in re.split(r'\{\d+\}', text) if i]
result has four elements, corresponding to {1} through {4} above.
(\t?\t?{\d+}.*?)(?={)
See demo.
https://regex101.com/r/OCpDb7/1
Edit:
If you want to capture last verse as well,use
(\t?\t?{\d+}.*?)(?={|\[BREAK\])
See demo.
https://regex101.com/r/OCpDb7/2
Your original regex suffered from 2 problems.
(\t?\t?{\d+}.*){
^ ^
1)You had used greedy operator.Use non greedy .*?
2)You were capturing { which would not allow that verse to match as it has been already captured.Use lookahead to just assert and not capture.
The answer above is good, but the verses are not always incremented properly in this book (ie, it can jump from verse 5 to 7 due to manuscript details) so I had to retain the verses to "pluck the number" them later. Basically, entire verses along with the number had to be extracted.
The recipe seemed to be this:
verse = re.compile(r'([\t+]?{\d+}[^{]*)', re.DOTALL)
In context:
import re
f = open('thebook.txt', 'r').read()
chapters = f.split('[BREAK]')
verse = re.compile(r'([\t+]?{\d+}[^{]*)', re.DOTALL)
verses = re.findall(verse, chapters[1])
Please note, it seems to work properly, but I have to check the results to make sure it accounts for everything.

Categories

Resources