Cleaning text using regex library not working properly

Cleaning text using regex library not working properly - python

I have a text that I need to clean for further processing.
Here is the sample text:
Nigel Reuben Rook Williams (15 July 1944 – 21 April 1992) was an
English conservator and expert on the restoration of ceramics and
glass. From 1961 until his death he worked at the British Museum,
where he became the Chief Conservator of Ceramics and Glass in 1983.
There his work included the successful restorations of the Sutton Hoo
helmet and the Portland Vase.
Joining as an assistant at age 16, Williams spent his entire career,
and most of his life, at the British Museum. He was one of the first
people to study conservation, not yet recognised as a profession, and
from an early age was given responsibility over high-profile objects.
In the 1960s he assisted with the re-excavation of the Sutton Hoo
ship-burial, and in his early- to mid-twenties he conserved many of
the objects found therein: most notably the Sutton Hoo helmet, which
occupied a year of his time. He likewise reconstructed other objects
from the find, including the shield, drinking horns, and maplewood
bottles.
The "abiding passion of his life" was ceramics,[4] and the 1970s and
1980s gave Williams ample opportunities in that field. After nearly
31,000 fragments of shattered Greek vases were found in 1974 amidst
the wreck of HMS Colossus, Williams set to work piecing them together.
The process was televised, and turned him into a television
personality. A decade later, in 1988 and 1989, Williams's crowning
achievement came when he took to pieces the Portland Vase, one of the
most famous glass objects in the world, and put it back together. The
reconstruction was again televised for a BBC programme, and as with
the Sutton Hoo helmet, took nearly a year to complete.
I need to:
split the text into sentences (by the full stop symbol '.'), eliminating the full stop symbol
split the sentences into words (only latin alphabet letters), other symbols should be replaced by the space character and only single spaces should be used to separate those words
Show all text in lowercase
I'm using a Mac and I get this code running:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
fread = open('source.txt')
fwrite = open('result.txt','w+')
for line in fread:
new_line = line
# split the text into sentences
new_line = re.sub(r"\." , "\r", new_line)
# change all uppercase letters to lowercase
new_line = new_line.lower()
# only latin letters
new_line = re.sub("[^a-z\s]", " ", new_line)
# The words should be separated by single spaces.
new_line = re.sub(r" +"," ", new_line)
# Getting rid of space in the beginning of the sentence
new_line = re.sub(r"ˆ\s+", "", new_line)
fwrite.write(new_line)
fread.close()
fwrite.close()
The result was not quite as expected. The spaces in the beginning of each line were not deleted. I ran the same code in a Windows machine and I noticed that sometines the full stop was replaced by the and some other times by . So I'm not sure what is happening.
Here is a sample of the result. Since spaces where not shown in stackoverflow, I had to show text as code:
nigel reuben rook williams july april was an english conservator and expert on the restoration of ceramics and glass
from until his death he worked at the british museum where he became the chief conservator of ceramics and glass in
there his work included the successful restorations of the sutton hoo helmet and the portland vase
joining as an assistant at age williams spent his entire career and most of his life at the british museum
he was one of the first people to study conservation not yet recognised as a profession and from an early age was given responsibility over high profile objects
in the s he assisted with the re excavation of the sutton hoo ship burial and in his early to mid twenties he conserved many of the objects found therein most notably the sutton hoo helmet which occupied a year of his time
he likewise reconstructed other objects from the find including the shield drinking horns and maplewood bottles
the abiding passion of his life was ceramics and the s and s gave williams ample opportunities in that field
after nearly fragments of shattered greek vases were found in amidst the wreck of hms colossus williams set to work piecing them together
the process was televised and turned him into a television personality
a decade later in and williams s crowning achievement came when he took to pieces the portland vase one of the most famous glass objects in the world and put it back together
the reconstruction was again televised for a bbc programme and as with the sutton hoo helmet took nearly a year to complete
Different characters may not appear as I see, for instance, before joining I see two ?? using TextWrangler.
Using the lstrip() function works to delete the spaces in the beginning of each sentence, by the way.
Why don't <new_line = re.sub(r"ˆ\s+", "", new_line)> work?
I suspect that the '\n' used to mark the end of the line is generating some problems.

# split the sentences into words
new_line = re.sub("[^a-z\s]", " ", new_line)
This isn't doing what the comment says. It's actually replacing all non-letter, non-space characters with a space, which is why your output is missing numbers and punctuation.
# Getting rid of space in the beginning of the sentence
new_line = re.sub(r"ˆ\s+", "", new_line)
I don't know what character is at the front of that regex, but it's not the beginning-of-line character ^.

A few mentions here:
Use context manager for in/out files because it handles closing after usage by default.
You have a wrong character as John Gordon say.
I recommend using some regex visualization tool (i.e. https://jex.im/regulex/)
Basic approach to replace something with only whitespace is to use plus operator [^a-z]+: (non-alphabet chars)+(one and more).
So the final code snippet I've made
# !/usr/bin/env python
# -*- coding: utf-8 -*-
import re
# It's better to use context manager to read files.
# You don't have to explicitly close those files after reading.
with open('./source.txt', 'r') as source:
text = ''
for line in source:
text += line.lower() # Lower case on reading, why not.
# only latin letters & single spaces at the same time
text = re.sub("[^a-z.]+", " ", text)
# # replace dots with newlines
text = re.sub(r'\.', r'\n', text)
with open('./result.txt', 'w+') as output:
output.write(text)

Related

Using Python to recognize to what character a line is directed towards in Shakespeare’s plays

I'm new to Python, and I’m using Python to extract lines said by certain characters in Shakespeare’s plays. I'm using a .txt file of Romeo and Juliet which essentially works as follows:
Jul. Wilt thou be gone? It is not yet near day.
It was the nightingale, and not the lark,
That pierc'd the fearful hollow of thine ear.
Nightly she sings on yond pomegranate tree.
Believe me, love, it was the nightingale.
Rom. It was the lark, the herald of the morn;
No nightingale. Look, love, what envious streaks
Do lace the severing clouds in yonder East.
Night's candles are burnt out, and jocund day
Stands tiptoe on the misty mountain tops.
I must be gone and live, or stay and die.
Jul. Yond light is not daylight; I know it, I.
It is some meteor that the sun exhales
To be to thee this night a torchbearer
And light thee on the way to Mantua.
Therefore stay yet; thou need'st not to be gone.
Rom. Let me be ta'en, let me be put to death.
I am content, so thou wilt have it so.
I'll say yon grey is not the morning's eye,
'Tis but the pale reflex of Cynthia's brow;
Nor that is not the lark whose notes do beat
The vaulty heaven so high above our heads.
I have more care to stay than will to go.
Come, death, and welcome! Juliet wills it so.
How is't, my soul? Let's talk; it is not day.
Jul. It is, it is! Hie hence, be gone, away!
It is the lark that sings so out of tune,
Straining harsh discords and unpleasing sharps.
Some say the lark makes sweet division;
This doth not so, for she divideth us.
Some say the lark and loathed toad chang'd eyes;
O, now I would they had chang'd voices too,
Since arm from arm that voice doth us affray,
Hunting thee hence with hunt's-up to the day!
O, now be gone! More light and light it grows.
Rom. More light and light- more dark and dark our woes!
The assumption I've made is that a line is directed towards the character that spoke directly before. For example, I assume that the last line of this text (' More light and light- more dark and dark our woes!') is directed towards Juliet (or Jul.).
I'm trying to extract all the lines spoken by Romeo, which are directed towards Juliet, using Regular Expression. This is the code I have so far:
def get_sentences(full_text):
sentences = sent_tokenize(full_text.strip())
return sentences
sentences = get_sentences(full_text)
lines = []
for lines in sentences:
if re.findall("\ARom.",lines):
print(lines)
However, this only returns a list as follows:
Rom. Rom. Rom. Rom. etc.
I've been trying to figure out what to do for hours, but I can't figure out what my next step should be.
Any help is greatly appreciated!

It looks like the pattern is that the first 'sentence' in lines is the characters name. So maybe you can split lines on the first period and take the first sentence as the name.
You could do that by using split() like:
character = lines.split('.')[0]

You might read all lines at once, and with multiline enabled using re.M write a pattern like:
^Rom\. .*(?:\n(?!(?:Rom|Jul)\. ).*)*
Explanation
^ Start of string
Rom\. Match Rom.
.* Match the whole line
(?: Non capture group
\n Match a newline - (?!(?:Rom|Jul)\. ).* Only match the whole line if it does not start with Rom. or Jul.
)* Optionally repeat the non capture group to match all lines
See a regex demo and a Python demo.

string.punctuation fails to remove certain characters from a string

My aim is to remove all punctuations from a string so that I can then get the frequency of each word in the string.
My string is:
WASHINGTON—Coming to the realization in front of millions of viewers
during the broadcast of his show, a horrified Tucker Carlson stated,
‘I…I am the mainstream media’ Wednesday as he began spiraling live on
air. “We’ve discovered evidence of rampant voter fraud, and the
president has every right to call for an investigation even if the
mainstream media thinks...” said Carlson, who trailed off, stared down
at his shaking hands, and felt a sudden ringing in his ears as he
looked back up and zeroed in on the production crew surrounding him.
“The media says…wait. Those liars on TV will try to tell you…oh God.
We’re the number-one program on cable news, aren’t we? Fox News…Fox
‘News.’ It’s the media. It’s me. This can’t be. No, no, no, no. Jesus
Christ, I make $6 million a year. Get that camera off me!” At press
time, Carlson had torn the microphone from his lapel and fled the set
in panic.
source: https://www.theonion.com/i-i-am-the-mainstream-media-realizes-horrified-tuc-1845646901
I want to remove all punctuations from it. I do that like this -
s.translate(str.maketrans('', '', string.punctuation))
This is the output -
WASHINGTON—Coming to the realization in front of millions of viewers
during the broadcast of his show a horrified Tucker Carlson stated
‘I…I am the mainstream media’ Wednesday as he began spiraling live on
air “We’ve discovered evidence of rampant voter fraud and the
president has every right to call for an investigation even if the
mainstream media thinks” said Carlson who trailed off stared down at
his shaking hands and felt a sudden ringing in his ears as he looked
back up and zeroed in on the production crew surrounding him “The
media says…wait Those liars on TV will try to tell you…oh God We’re
the numberone program on cable news aren’t we Fox News…Fox ‘News’ It’s
the media It’s me This can’t be No no no no Jesus Christ I make 6
million a year Get that camera off me” At press time Carlson had torn
the microphone from his lapel and fled the set in panic
As you can see that characters/ string like ", — and ... still exist. Am I incorrectly expecting them to be removed too? If the output is correct then how can I NOT differentiate between "`News`" and "News"?

>>> import string
>>> "“" in string.punctuation
False
>>> "—" in string.punctuation
False
Welcome to the wonderful world of Unicode where, among many other things, … is not three concatenated full stop periods and :
>>> import unicodedata
>>> unicodedata.name('—')
'EM DASH'
is not a hyphen.
How you want to handle the full scope of what could be considered 'punctuation' across the Unicode table is probably out of scope for this question, but you could either come up with your own ad-hoc list or use a third-party library designed for that type of text manipulation. Here is one such approach:
Best way to strip punctuation from a string

I added the list of characters you can remove from string by using your implementation.
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
You can check this implementation to remove all special characters and keep whitespaces
''.join(e for e in s if e.isalnum() or e == ' ')

It looks like the … and a couple of the other characters you are having trouble with are special Unicode characters. A workaround is to use string.isalpha(), which tells you whether the characters of a string are part of the alphabet or not.
result = ""
for x in string:
if x.isalpha() or x == " ":
result = result + x

How to capture from whitespace +{n} to next {n} in regex

I've cleaned up a document to allow me to properly rip it verse by verse. Being weak in regex I cannot seem to find the right expression to extract these verses.
This is the expression I am using:
(\t?\t?{\d+}.*){
And I'm doing this in python, though I expect that does not matter.
How should I change this to make it simply highlight verses {x} some verse {x} next verse, but stopping short just of the next brace?
As you can see, I'm trying to keep it tabs-aware because this doc gives some attention to verse-style writing.
And here is an example doc:
{1} The words of the blessing of Enoch, wherewith he blessed the elect [[[[and]]]] righteous, who will be living in the day of tribulation, when all the wicked [[[[and godless]]]] are to be removed. {2} And he took up his parable and said--Enoch a righteous man, whose eyes were opened by God, saw the vision of the Holy One in the heavens, [[which]] the angels showed me, and from them I heard everything, and from them I understood as I saw, but not for this generation, but for a remote one which is for to come. {3} Concerning the elect I said, and took up my parable concerning them:
The Holy Great One will come forth from His dwelling,
{4} And the eternal God will tread upon the earth, (even) on Mount Sinai,
[[And appear from His camp]]
And appear in the strength of His might from the heaven of heavens.
{5} And all shall be smitten with fear
And the Watchers shall quake,
And great fear and trembling shall seize them unto the ends of the earth.
{6} And the high mountains shall be shaken,
And the high hills shall be made low,
And shall melt like wax before the flame
{7} And the earth shall be [[wholly]] rent in sunder,
And all that is upon the earth shall perish,
And there shall be a judgement upon all (men).
{8} But with the righteous He will make peace.
And will protect the elect,
And mercy shall be upon them.
And they shall all belong to God,
And they shall be prospered,
And they shall [[all]] be blessed.
[[And He will help them all]],
And light shall appear unto them,
[[And He will make peace with them]].
{9} And behold! He cometh with ten thousands of [[His]] holy ones
To execute judgement upon all,
And to destroy [[all]] the ungodly:
And to convict all flesh
Of all the works [[of their ungodliness]] which they have ungodly committed,
And of all the hard things which ungodly sinners [[have spoken]] against Him.
[BREAK]
[CHAPTER 2]

Simply split the text on the verse markers with re.split:
import re
text = '''{1} The words of the blessing of Enoch, wherewith he blessed the elect [[[[and]]]] righteous, who will be living in the day of tribulation, when all the wicked [[[[and godless]]]] are to be removed. {2} And he took up his parable and said--Enoch a righteous man, whose eyes were opened by God, saw the vision of the Holy One in the heavens, [[which]] the angels showed me, and from them I heard everything, and from them I understood as I saw, but not for this generation, but for a remote one which is for to come. {3} Concerning the elect I said, and took up my parable concerning them:
The Holy Great One will come forth from His dwelling,
{4} And the eternal God will tread upon the earth, (even) on Mount Sinai,
[[And appear from His camp]]
And appear in the strength of His might from the heaven of heavens.'''
result = [i for i in re.split(r'\{\d+\}', text) if i]
result has four elements, corresponding to {1} through {4} above.

(\t?\t?{\d+}.*?)(?={)
See demo.
https://regex101.com/r/OCpDb7/1
Edit:
If you want to capture last verse as well,use
(\t?\t?{\d+}.*?)(?={|\[BREAK\])
See demo.
https://regex101.com/r/OCpDb7/2
Your original regex suffered from 2 problems.
(\t?\t?{\d+}.*){
^ ^
1)You had used greedy operator.Use non greedy .*?
2)You were capturing { which would not allow that verse to match as it has been already captured.Use lookahead to just assert and not capture.

The answer above is good, but the verses are not always incremented properly in this book (ie, it can jump from verse 5 to 7 due to manuscript details) so I had to retain the verses to "pluck the number" them later. Basically, entire verses along with the number had to be extracted.
The recipe seemed to be this:
verse = re.compile(r'([\t+]?{\d+}[^{]*)', re.DOTALL)
In context:
import re
f = open('thebook.txt', 'r').read()
chapters = f.split('[BREAK]')
verse = re.compile(r'([\t+]?{\d+}[^{]*)', re.DOTALL)
verses = re.findall(verse, chapters[1])
Please note, it seems to work properly, but I have to check the results to make sure it accounts for everything.

Using regex to split text content into dictionary

I have a text file that follows this format.
LESTER HOLT (00:00:01): Breaking News Tonight: A deadly mass shooting
at the airport. A gunman opens fire at baggage claim in Fort
Lauderdale, witnesses describing scenes of sheer horror. A silent
killer shooting people in the head as they tried to run and hide.
Tonight, a storm of questions. Why did he do it? The suspect, a
passenger with a firearm in his checked bag. New concerns about
airport security before the checkpoint.
(00:00:25): Also breaking tonight the new report from U.S.
intelligence: Vladimir Putin himself ordered the effort to influence
the election, aimed at hurting Clinton and helping Trump win. What the
President-elect is saying after his top-secret briefing.
(00:00:39): And States of Emergency: Millions from coast to coast
paralyzed by a massive winter storm.
(00:00:45): NIGHTLY NEWS begins right now.
I am trying to parse this information into a Python Dictionary, where the speaker is a dictionary, of dictionaries, which has timecode keys, and the content is the value, I can't consistently split because of potential information before the timecode, (IE the first quote), as well as the fact that the split character : is also a character involved with the timecode itself 00:00:00.
Trying to split according to the regex.
for line in msg.get_payload().split('\n'):
regex = r'\d{2}:\d{2}:\d{2}'
test = re.split(regex, line)
print(test)
sleep(1)
Appears to work in splitting it properly, but it causes me to lose the value I am splitting on (timecode), which I intend to use as a key. How can I properly split the above content, get the speaker, and then get the timecode as a key, and the content as a value.It is possible he may be present later in the text as well, and it should append to the list of timecodes./
The output format I am targeting is something along the lines of
{speakers:{'Lester Holt': {'00:00:01':content..., '00:00:0025': content...},
'speaker2':{etc,etc,etc} }}
Ive tried using the split as mentioned above, but it removes my timecode variable.
Any thoughts and guidance is appreciated.

Don't bother with split. You're trying to get 2-3 pieces of information out of each line, so try the following:
for line in msg.get_payload().split('\n'):
match = re.search(r'^\s*([^(]*?)\s*\((\d{2}:\d{2}:\d{2})\):\s*(.*)$', line)
if match:
(speaker, time, message) = match.groups()
Speaker will be an empty string if none was present on that line.
Regex explanation:
^ # Start of line
\s* # Drop leading whitespace
([^(]*?) # Capture the speaker if present (non-paren characters)
\s* # Drop whitespace between name and time
\( # Drop literal open paren
(\d{2}:\d{2}:\d{2}) # Capture time
\):\s* # Drop literal close paren, colon and whitespace
(.*) # Capture the rest of the line
$ # End of line

Splitting message in lines when you need to split it in time-stamped paragraphs is a waste. re.split can easily save the tokens that it split on, if you only look at the documentation. Here's my solution:
toks = re.split(r"\((\d\d:\d\d:\d\d)\):", msg.get_payload())[1:]
answer = dict(zip(toks[::2], toks[1::2]))
This creates a dictionary of timestamps and paragraphs. Feel free to use the same approach to split by speaker as well.
Result:
{
'00:00:01': ' Breaking News Tonight: A .....',
'00:00:25': ' Also breaking tonight ......', ....
}

Removing "\n"s when printing sentences from text file in python?

I am trying to print a list of sentences from a text file (one of the Project Gutenberg eBooks). When I print the file as a single string string it looks fine:
file = open('11.txt','r+')
alice = file.read()
print(alice[:500])
Output is:
ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d
Now, when I split it into sentences (The assignment was specifically to do this by "splitting at the periods," so it's a very simplified split), I get this:
>>> print(sentences[:5])
["ALICE'S ADVENTURES IN WONDERLAND\n\nLewis Carroll\n\nTHE MILLENNIUM FULCRUM EDITION 3", '0\n\n\n\n\nCHAPTER I', " Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversations?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her", "\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!\nOh dear! I shall be late!' (when she thought it over afterwards, it\noccurred to her that she ought to have wondered at this, but at the time\nit all seemed quite natural); but when the Rabbit actually TOOK A WATCH\nOUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,\nAlice started to her feet, for it flashed across her mind that she had\nnever before seen a rabbit with either a waistcoat-pocket, or a watch\nto take out of it, and burning with curiosity, she ran across the field\nafter it, and fortunately was just in time to see it pop down a large\nrabbit-hole under the hedge", '\n\nIn another moment down went Alice after it, never once considering how\nin the world she was to get out again']
Where do the extra "\n" characters come from and how can I remove them?

If you want to replace all the newlines with one space, do this:
import re
new_sentences = [re.sub(r'\n+', ' ', s) for s in sentences]

You may not want to use regex, but I would do:
import re
new_sentences = []
for s in sentences:
new_sentences.append(re.sub(r'\n{2,}', '\n', s))
This should replace all instances of two or more '\n' with a single newline, so you still have newlines, but don't have "extra" newlines.
If you want to avoid creating a new list, and instead modify the existing one (credit to #gavriel and Andrew L.: I hadn't thought of using enumerate when I first posted my answer):
import re
for i, s in enumerate(sentences):
sentences[i] = re.sub(r'\n{2,}', '\n', s)
The extra newlines aren't really extra, by which I mean they are meant to be there and are visible in the text in your question: the more '\n' there are, the more space there is visible between the lines of text (i.e., one between the chapter heading and the first paragraph, and many between the edition and the chapter heading.

You'll understand where the \n characters come from with this little example:
alice = """ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d"""
print len(alice.split("."))
print len(alice.split("\n"))
It all depends the way you're splitting your text, the above example will give this output:
3
19
Which means there are 3 substrings if you were to split the text using . or 19 substrings if you splitted using \n as separator. You can read more about str.split
In your case you've splitted your text using ., so the 3 substrings will contain multiple newlines characters \n, to get rid of them you can either split these substrings again or just get rid of them using str.replace

The text uses newlines to delimit sentences as well as fullstops. You have an issue where just replacing the new line characters with an empty string will result in having words without spaces between them. Before you split alice by '.', I would use something along the lines of #elethan's solution to replace all of the multiple new lines in alice with a '.' Then you could do alice.split('.') and all of the sentences separated with multiple new lines would be split appropriately along with the sentences separated with . initially.
Then your only issue is the decimal point in the version number.

file = open('11.txt','r+')
file.read().split('\n')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.