How to capture from whitespace +{n} to next {n} in regex

How to capture from whitespace +{n} to next {n} in regex - python

I've cleaned up a document to allow me to properly rip it verse by verse. Being weak in regex I cannot seem to find the right expression to extract these verses.
This is the expression I am using:
(\t?\t?{\d+}.*){
And I'm doing this in python, though I expect that does not matter.
How should I change this to make it simply highlight verses {x} some verse {x} next verse, but stopping short just of the next brace?
As you can see, I'm trying to keep it tabs-aware because this doc gives some attention to verse-style writing.
And here is an example doc:
{1} The words of the blessing of Enoch, wherewith he blessed the elect [[[[and]]]] righteous, who will be living in the day of tribulation, when all the wicked [[[[and godless]]]] are to be removed. {2} And he took up his parable and said--Enoch a righteous man, whose eyes were opened by God, saw the vision of the Holy One in the heavens, [[which]] the angels showed me, and from them I heard everything, and from them I understood as I saw, but not for this generation, but for a remote one which is for to come. {3} Concerning the elect I said, and took up my parable concerning them:
The Holy Great One will come forth from His dwelling,
{4} And the eternal God will tread upon the earth, (even) on Mount Sinai,
[[And appear from His camp]]
And appear in the strength of His might from the heaven of heavens.
{5} And all shall be smitten with fear
And the Watchers shall quake,
And great fear and trembling shall seize them unto the ends of the earth.
{6} And the high mountains shall be shaken,
And the high hills shall be made low,
And shall melt like wax before the flame
{7} And the earth shall be [[wholly]] rent in sunder,
And all that is upon the earth shall perish,
And there shall be a judgement upon all (men).
{8} But with the righteous He will make peace.
And will protect the elect,
And mercy shall be upon them.
And they shall all belong to God,
And they shall be prospered,
And they shall [[all]] be blessed.
[[And He will help them all]],
And light shall appear unto them,
[[And He will make peace with them]].
{9} And behold! He cometh with ten thousands of [[His]] holy ones
To execute judgement upon all,
And to destroy [[all]] the ungodly:
And to convict all flesh
Of all the works [[of their ungodliness]] which they have ungodly committed,
And of all the hard things which ungodly sinners [[have spoken]] against Him.
[BREAK]
[CHAPTER 2]

Simply split the text on the verse markers with re.split:
import re
text = '''{1} The words of the blessing of Enoch, wherewith he blessed the elect [[[[and]]]] righteous, who will be living in the day of tribulation, when all the wicked [[[[and godless]]]] are to be removed. {2} And he took up his parable and said--Enoch a righteous man, whose eyes were opened by God, saw the vision of the Holy One in the heavens, [[which]] the angels showed me, and from them I heard everything, and from them I understood as I saw, but not for this generation, but for a remote one which is for to come. {3} Concerning the elect I said, and took up my parable concerning them:
The Holy Great One will come forth from His dwelling,
{4} And the eternal God will tread upon the earth, (even) on Mount Sinai,
[[And appear from His camp]]
And appear in the strength of His might from the heaven of heavens.'''
result = [i for i in re.split(r'\{\d+\}', text) if i]
result has four elements, corresponding to {1} through {4} above.

(\t?\t?{\d+}.*?)(?={)
See demo.
https://regex101.com/r/OCpDb7/1
Edit:
If you want to capture last verse as well,use
(\t?\t?{\d+}.*?)(?={|\[BREAK\])
See demo.
https://regex101.com/r/OCpDb7/2
Your original regex suffered from 2 problems.
(\t?\t?{\d+}.*){
^ ^
1)You had used greedy operator.Use non greedy .*?
2)You were capturing { which would not allow that verse to match as it has been already captured.Use lookahead to just assert and not capture.

The answer above is good, but the verses are not always incremented properly in this book (ie, it can jump from verse 5 to 7 due to manuscript details) so I had to retain the verses to "pluck the number" them later. Basically, entire verses along with the number had to be extracted.
The recipe seemed to be this:
verse = re.compile(r'([\t+]?{\d+}[^{]*)', re.DOTALL)
In context:
import re
f = open('thebook.txt', 'r').read()
chapters = f.split('[BREAK]')
verse = re.compile(r'([\t+]?{\d+}[^{]*)', re.DOTALL)
verses = re.findall(verse, chapters[1])
Please note, it seems to work properly, but I have to check the results to make sure it accounts for everything.

Related

Using Python to recognize to what character a line is directed towards in Shakespeare’s plays

I'm new to Python, and I’m using Python to extract lines said by certain characters in Shakespeare’s plays. I'm using a .txt file of Romeo and Juliet which essentially works as follows:
Jul. Wilt thou be gone? It is not yet near day.
It was the nightingale, and not the lark,
That pierc'd the fearful hollow of thine ear.
Nightly she sings on yond pomegranate tree.
Believe me, love, it was the nightingale.
Rom. It was the lark, the herald of the morn;
No nightingale. Look, love, what envious streaks
Do lace the severing clouds in yonder East.
Night's candles are burnt out, and jocund day
Stands tiptoe on the misty mountain tops.
I must be gone and live, or stay and die.
Jul. Yond light is not daylight; I know it, I.
It is some meteor that the sun exhales
To be to thee this night a torchbearer
And light thee on the way to Mantua.
Therefore stay yet; thou need'st not to be gone.
Rom. Let me be ta'en, let me be put to death.
I am content, so thou wilt have it so.
I'll say yon grey is not the morning's eye,
'Tis but the pale reflex of Cynthia's brow;
Nor that is not the lark whose notes do beat
The vaulty heaven so high above our heads.
I have more care to stay than will to go.
Come, death, and welcome! Juliet wills it so.
How is't, my soul? Let's talk; it is not day.
Jul. It is, it is! Hie hence, be gone, away!
It is the lark that sings so out of tune,
Straining harsh discords and unpleasing sharps.
Some say the lark makes sweet division;
This doth not so, for she divideth us.
Some say the lark and loathed toad chang'd eyes;
O, now I would they had chang'd voices too,
Since arm from arm that voice doth us affray,
Hunting thee hence with hunt's-up to the day!
O, now be gone! More light and light it grows.
Rom. More light and light- more dark and dark our woes!
The assumption I've made is that a line is directed towards the character that spoke directly before. For example, I assume that the last line of this text (' More light and light- more dark and dark our woes!') is directed towards Juliet (or Jul.).
I'm trying to extract all the lines spoken by Romeo, which are directed towards Juliet, using Regular Expression. This is the code I have so far:
def get_sentences(full_text):
sentences = sent_tokenize(full_text.strip())
return sentences
sentences = get_sentences(full_text)
lines = []
for lines in sentences:
if re.findall("\ARom.",lines):
print(lines)
However, this only returns a list as follows:
Rom. Rom. Rom. Rom. etc.
I've been trying to figure out what to do for hours, but I can't figure out what my next step should be.
Any help is greatly appreciated!

It looks like the pattern is that the first 'sentence' in lines is the characters name. So maybe you can split lines on the first period and take the first sentence as the name.
You could do that by using split() like:
character = lines.split('.')[0]

You might read all lines at once, and with multiline enabled using re.M write a pattern like:
^Rom\. .*(?:\n(?!(?:Rom|Jul)\. ).*)*
Explanation
^ Start of string
Rom\. Match Rom.
.* Match the whole line
(?: Non capture group
\n Match a newline - (?!(?:Rom|Jul)\. ).* Only match the whole line if it does not start with Rom. or Jul.
)* Optionally repeat the non capture group to match all lines
See a regex demo and a Python demo.

string.punctuation fails to remove certain characters from a string

My aim is to remove all punctuations from a string so that I can then get the frequency of each word in the string.
My string is:
WASHINGTON—Coming to the realization in front of millions of viewers
during the broadcast of his show, a horrified Tucker Carlson stated,
‘I…I am the mainstream media’ Wednesday as he began spiraling live on
air. “We’ve discovered evidence of rampant voter fraud, and the
president has every right to call for an investigation even if the
mainstream media thinks...” said Carlson, who trailed off, stared down
at his shaking hands, and felt a sudden ringing in his ears as he
looked back up and zeroed in on the production crew surrounding him.
“The media says…wait. Those liars on TV will try to tell you…oh God.
We’re the number-one program on cable news, aren’t we? Fox News…Fox
‘News.’ It’s the media. It’s me. This can’t be. No, no, no, no. Jesus
Christ, I make $6 million a year. Get that camera off me!” At press
time, Carlson had torn the microphone from his lapel and fled the set
in panic.
source: https://www.theonion.com/i-i-am-the-mainstream-media-realizes-horrified-tuc-1845646901
I want to remove all punctuations from it. I do that like this -
s.translate(str.maketrans('', '', string.punctuation))
This is the output -
WASHINGTON—Coming to the realization in front of millions of viewers
during the broadcast of his show a horrified Tucker Carlson stated
‘I…I am the mainstream media’ Wednesday as he began spiraling live on
air “We’ve discovered evidence of rampant voter fraud and the
president has every right to call for an investigation even if the
mainstream media thinks” said Carlson who trailed off stared down at
his shaking hands and felt a sudden ringing in his ears as he looked
back up and zeroed in on the production crew surrounding him “The
media says…wait Those liars on TV will try to tell you…oh God We’re
the numberone program on cable news aren’t we Fox News…Fox ‘News’ It’s
the media It’s me This can’t be No no no no Jesus Christ I make 6
million a year Get that camera off me” At press time Carlson had torn
the microphone from his lapel and fled the set in panic
As you can see that characters/ string like ", — and ... still exist. Am I incorrectly expecting them to be removed too? If the output is correct then how can I NOT differentiate between "`News`" and "News"?

>>> import string
>>> "“" in string.punctuation
False
>>> "—" in string.punctuation
False
Welcome to the wonderful world of Unicode where, among many other things, … is not three concatenated full stop periods and :
>>> import unicodedata
>>> unicodedata.name('—')
'EM DASH'
is not a hyphen.
How you want to handle the full scope of what could be considered 'punctuation' across the Unicode table is probably out of scope for this question, but you could either come up with your own ad-hoc list or use a third-party library designed for that type of text manipulation. Here is one such approach:
Best way to strip punctuation from a string

I added the list of characters you can remove from string by using your implementation.
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
You can check this implementation to remove all special characters and keep whitespaces
''.join(e for e in s if e.isalnum() or e == ' ')

It looks like the … and a couple of the other characters you are having trouble with are special Unicode characters. A workaround is to use string.isalpha(), which tells you whether the characters of a string are part of the alphabet or not.
result = ""
for x in string:
if x.isalpha() or x == " ":
result = result + x

How to make the following substitution in string using regex in Python?

I'm trying to make a substitution in the following string:
poem='''
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aching,
Or cool one pain,
Or help one fainting robin
Unto his nest again,
I shall not live in vain.
'''
The requirement are as below in the given string:
If the pattern has characters 'ai' or 'hi', replace the next three characters with *\*.
If a word has 'ch' or 'co', replace it with 'Ch' or 'Co'.
I tried the following methods:
print(re.sub(r"ai\w{3}|hi\w{3}",r"(ai|hi)*\*",poem))
Output:
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aching,
Or cool one pain,
Or help one f(ai|hi)*\*ng robin
Unto his nest again,
I shall not live in vain.
print(re.sub(r"ch|co",r"Ch|Co",poem))
Output:
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aCh|Coing,
Or Ch|Cool one pain,
Or help one fainting robin
Unto his nest again,
I shall not live in vain.
You can see the output is not as per the requirements. Please help me find the correct regex expression.

The first you can achieve by referencing a captured group from the pattern in the replacement:
poem = re.sub(r"(ai|hi)\w{3}", "\g<1>*\*", poem)
For the second, you can pass a function as replacement (see the re.sub docs):
def title(match):
return match.group(0).title() # or .capitalize()
poem = re.sub(r"ch|co", title, poem)

import re
poem = re.sub(r'(ai|hi)(...)', r'\1*\*', poem)
poem = re.sub('ch', 'Ch', poem)
poem = re.sub('co', 'Co', poem)
print(poem)
This outputs:
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aChi*\*
Or Cool one pain,
Or help one fai*\*ng robin
Unto hi*\*est again,
I shall not live in vain.

You can replace those step wise:
poem='''
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aching,
Or cool one pain,
Or help one fainting robin
Unto his nest again,
I shall not live in vain.
'''
import re
p2 = re.sub("(?:ai|hi)...","*/*",poem)
p3 = re.sub("ch","Ch",p2)
p4 = re.sub("co","Co",p3)
print(p4)
Output:
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the ac*/*
Or Cool one pain,
Or help one f*/*ng robin
Unto */*est again,
I shall not live in vain.
The only interesting thing is a non-capturing group around ai|hi that does not work as I expected - ai and hi are still replaced. You might want to change them to:
p = re.sub("ai...","*/*",poem, flags = re.DOTALL)
p2 = re.sub("hi...","*/*",p, flags= re.DOTALL)
p3 = re.sub("ch","Ch",p2)
p4 = re.sub("co","Co",p3)
print(p4)
Output:
If I can stop one heart from breaking,
I shall not live in v*/*If I can ease one life the ac*/*
Or Cool one p*/*Or help one f*/*ng robin
Unto */*est ag*/*I shall not live in v*/*
The flag re.DOTALL lets . also match newline characters.
Without it, vain; would not be matched.

print(re.sub(r"co",r"Co",re.sub(r"ch",r"Ch",s)))
This works:
Input:
s='''It takes strength for being certain,
It takes courage to have doubt.
It takes strength for challenging alone,
It takes courage to lean on another.
It takes strength for loving other souls,
It takes courage to be loved.
It takes strength for hiding our own pain,
It takes courage to help if it is paining for someone.'''
Output:
It takes strength for being certain,
It takes Courage to have doubt.
It takes strength for Challenging alone,
It takes Courage to lean on another.
It takes strength for loving other souls,
It takes Courage to be loved.
It takes strength for hiding our own pain,
It takes Courage to help if it is paining for someone.

Here's an answer to your question:
import re
poem='''
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aching,
Or cool one pain,
Or help one fainting robin
Unto his nest again,
I shall not live in vain.
'''`
p1=poem
print(re.sub(r"\n","",poem))
poem=re.sub(r"co","Co",poem)
poem=re.sub(r"ch","Ch",poem)
print(poem)
print(re.sub(r"ai|hi{3}","*/*",p1))`

You can use | which acts as or to mention options and create groups using () to match and retain some of the groups by using \<group number> (1-indexed) in replace string
For first one you can make 2 groups to match (hi|ai) and to match next 3 characters like (...) and then replace only second group and retain the first group using \1
print(re.sub(r'(hi|ai)(...)', r'\1*\*', poem))
For second one you can make 2 groups to match (c) and (h|o) and retain second group using \2
print(re.sub(r'(c)(h|o)', r'C\2', poem))

Regex parse Buffy Script using look behinds

I'm having a difficult time parsing this page: http://www.buffyworld.com/buffy/transcripts/114_tran.html
I'm attempting to get the character name with the associated dialogue.
The text looks like this:
<p>BUFFY: Wait!
<p>She stands there panting, watching the truck turn a corner.
<p>BUFFY: (whining) Don't you want your garbage?
<p>She sighs, pouts, turns and walks back toward the house.
<p>Cut to the kitchen. Buffy enters through the back door, holding a pile of
mail. She begins looking through it. We see Dawn standing by the island.
<p>DAWN: Hey Buffy. Oh, don't forget, today's trash day.<br>BUFFY: (sourly)
Thanks.
<p>Dawn piles her books into her school bag. Buffy opens a letter.
<p>Close shot of the letter.
<p>
<p>Dawn smiles, and she and Willow exit. Buffy picks up the still-wrapped
sandwich and stares at it.
<p>BUFFY: (to herself) Somebody should.
<p>She sighs, puts the sandwich back in the bag.
<p>Cut to the Bronze. Pan across various people drinking and dancing,
bartender serving. Reveal Xander and Anya sitting at the bar eating chips from
several bags. A notebook sits in front of them bearing the wedding seating
chart.
<p>ANYA: See ... this seating chart makes no sense. We have to do it again.
(Xander nodding) We can't do it again. You do it.<br>XANDER: The seating
chart's fine. Let's get back to the table arrangements. I'm starting to have
dreams of gardenia bouquets. (winces) I am so glad my manly coworkers didn't
just hear me say that. (eating chips)
Ideally, I'd match from <p> or <br> to the next <p> or <br>. I was trying to use look aheads and look behinds for this:
reg = "((?<=<p>)|(?<=<br>))(?P<character>.+):(?P<dialogue>.+)((?=<p>)|(?=<br>))"
script = re.findall(reg, html_text)
Unfortunately, this doesn't match anything. When I leave off the lookahead ((?=<p>)|(?=<br>)), I match lines as long as there isn't a newline in the matching dialogue. It seems to terminate at the newline instead of continuing to the <p>
ex. On this line, the "Thanks" isn't matched. <p>DAWN: Hey Buffy. Oh, don't forget, today's trash day.<br>BUFFY: (sourly)
Thanks.
Thank you for any insight you have!

Work around the dot notation:
re.findall('((?<=<p>)|(?<=<br>))([A-Z]+):([^<]+)', text)
Also you can try a special flag to include linebreaks into the semantics of the dot. Personally, when I can I use splits or some html parser. RE escaping, all the parameters, limitations and flags can drive mad anyone. There is also re.split.
dialogs = {}
text = html_text.replace('<br>', '<p>')
paragraphs = text.split('<p>')
for p in paragraphs:
if ":" in p:
char, line = p.split(":", 1)
if char in dialogs:
dialogs[char].append(line)
else:
dialogs[char] = []

python textwrap breaking sentences in wrong places

I'm finding python's textwrap library is breaking sentences in the wrong places. I'm using:
wrp = textwrap.TextWrapper(width=32,break_long_words=False,replace_whitespace=False)
out = '\n'.join(wrp.wrap(txt))
Applying this to the following passage*:
The Caterpillar and Alice looked at each other for some time in silence:
at last the Caterpillar took the hookah out of its mouth, and addressed
her in a languid, sleepy voice.
'Who are YOU?' said the Caterpillar.
This was not an encouraging opening for a conversation. Alice replied,
rather shyly, 'I--I hardly know, sir, just at present--at least I know
who I WAS when I got up this morning, but I think I must have been
changed several times since then.'
The result of the wrap is:
The Caterpillar and Alice looked
at each other for some time in
silence:
at last the
Caterpillar took the hookah out
of its mouth, and addressed
her
in a languid, sleepy voice.
'Who are YOU?' said the
Caterpillar.
This was not an
encouraging opening for a
conversation. Alice replied,
rather shyly, 'I--I hardly know,
sir, just at present--at least I
know
who I WAS when I got up
this morning, but I think I must
have been
changed several times
since then.
A few of the extra breaks are because the original text is already wrapped. But still incorrect breaks have been added at e.g. at last the | Caterpillar, and the last sentence is a complete mess. Can anyone advise how to properly wrap this?
passage sourced with curl https://www.gutenberg.org/cache/epub/11/pg11.txt | sed -n 960,969p> alice.txt

Preserving text format: We replace any return followed or preceded by a letter. That ensure text formatting is kept:
re.sub("([,\w])\n(\w)", "\1 \2", sys.stdin.read())
The Caterpillar and Alice looked at each other for some time in silence:
at last the Caterpillar took the hookah out of its mouth, and addressed her in a languid, sleepy voice.
'Who are YOU?' said the Caterpillar.
This was not an encouraging opening for a conversation. Alice replied, rather shyly, 'I--I hardly know, sir, just at present--at least I know who I WAS when I got up this morning, but I think I must have been changed several times since then.'
You can then wrap every parts:
text = re.sub("([,\w])\n(\w)", "\1 \2", sys.stdin.read())
for part in text.splitlines():
print '\n'.join(textwrap.wrap(part, width=32))
The Caterpillar and Alice looked
at each other for some time in
silence:
at last the Caterpillar took the
hookah out of its mouth, and
addressed her in a languid,
sleepy voice.
'Who are YOU?' said the
Caterpillar.
This was not an encouraging
opening for a conversation.
Alice replied, rather shyly, 'I
--I hardly know, sir, just at
present--at least I know who I
WAS when I got up this morning,
but I think I must have been
changed several times since
then.'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.