I have a variable
sentence = "In 1794, shortly after his arrival in Manchester, Dalton was elected a member of the Manchester Literary and Philosophical Society, the "Lit & Phil", and a few weeks later he communicated his first paper on "Extraordinary facts relating to the vision of colours", in which he postulated that shortage in colour perception was caused by discoloration of the liquid medium of the eyeball. In fact, a shortage of colour perception in some people had not even been formally described or officially noticed until Dalton wrote about his own. Since both he and his brother were colour blind, he recognized that this condition must be hereditary."
Now this may have both "" and '' due to which it will close the value of the variable. I want to prevent this. Is there any other way of storing a string?
Escape the embedded quotes with \:
"Some text with \"embedded\" quotes"
If your text contains only double quotes, you can single quotes and not have to escape the double quotes:
'Some text with "embedded" quotes'
Last but not least, you can triple the outer quotes and save yourself having to escape newlines too:
"""Some text with "embedded" quotes"""
"""Some text with "embedded" quotes
and a newline too"""
For your example, single quotes would already do the trick:
sentence = 'In 1794, shortly after his arrival in Manchester, Dalton was elected a member of the Manchester Literary and Philosophical Society, the "Lit & Phil", and a few weeks later he communicated his first paper on "Extraordinary facts relating to the vision of colours", in which he postulated that shortage in colour perception was caused by discoloration of the liquid medium of the eyeball. In fact, a shortage of colour perception in some people had not even been formally described or officially noticed until Dalton wrote about his own. Since both he and his brother were colour blind, he recognized that this condition must be hereditary.'
a="""triple quoted strings can contain quote like this " without ending the string"""
You can use triple quotes like:
sentence = """ long sentence with all 'kind" of symbols """
Related
I was trying to parse together a script for a movie into a dataset containing two columns 'speaker_name' and 'line_spoken'. I don't have any issue with the Python part of the problem but parsing the script is the problem.
The schema of the script goes like this:
Now, this, if copied and pasted into a .txt file is something like this:
ARTHUR
Yeah. I mean, that's just--
SOCIAL WORKER
Does my reading it upset you?
He leans in.
ARTHUR
No. I just,-- some of it's
personal. You know?
SOCIAL WORKER
I understand. I just want to make
sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
In the above case, the regex filtering should return the speaker name and the dialogue and not what is happening in actions like the last line: "slides his journal back". The dialogues often exceed more than two lines so please do not provide hard-coded solutions for 2 lines only. I think I am thinking about this problem in just one direction, some other method to filter can also work.
I have worked with scripts that are colon-separated and I don't have any problem parsing those. But in this case, I am getting no specific endpoints to end the search at. It would be a great help if the answer you give has 2 groups, one with name, the other with the dialogue. Like in the case of colon-separated, my regex was:
pattern = r'(^[a-zA-z]+):(.+)'
Also, if possible, please try and explain why you used that certain regex. It will be a learning experience for me.
Use https://www.onlineocr.net/ co convert pdf to text,
It shows immediately the outcome, where names and on the same line with dialogs,
which could allow for a simple processing
ARTHUR Yeah. I mean, that's just--
SOCIAL WORKER Does my reading it upset you?
He leans in.
ARTHUR No. I just,-- some of its personal. You know me ?
SOCIAL WORKER I understand. I just want to make sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
Not sure will it work for longer dialogs.
Another solution is to extract data from the text file that you can download by clicking the "download output file" link . That file is formatted differently. In that file
10 leading spaces will indicate the dialog, and 5 leading spaces the name - a the least for you sample screenshot
The regex is
r" (.+)(\n( [^ ].+\n)+)"
https://regex101.com/r/FQk8uH/1
it puts in group 1 whatever starts with ten spaces and whatever starts with at the exactly five space into the second :
the subexpression " [^ ].+\n" denotes a line where the first five symbols are spaces, the sixth symbol is anything but space, and the rest of symbols until the end of line are arbitrary. Since dialogs tend to be multiline that expression is followed with another plus.
You will have to delete extra white space from dialogue with additional code and/or regex.
If the amount of spaces varies a bit (say 4-6 and 7 - 14 respectively) but has distinct section the regex needs to be adjusted by using variable repetition operator (curly braces {4, 6}) or optional spaces ?.
r" {7, 14}(.+)(\n( {4-6}[^ ].+\n)+)"
The last idea is to use preexisting list of names in play to match them e.g. (SOCIAL WORKER|JOHN|MARY|ARTUR). The https://www.onlineocr.net/ website still could be used to help spot and delete actions
In Python, you can use DOTALL:
re_pattern = re.compile(r'(\b[A-Z ]{3,}(?=\n))\n*(.*?)\n*(?=\b[A-Z ]{3,}\n|$)', re.DOTALL)
print(re.findall(re_pattern, mystr))
\b[A-Z ]{3,}(?=\n) matches speaker name.
\b matches a word boundary
[A-Z ]{3,} matches three or more upper case letters or spaces. (this means this regex won't recognize speaker names with less than three characters. I did this to avoid false positives in special cases but you might wanna change it. Also check what kind of characters might occur in speaker name (dots, minus, lower case...))
(?=\n) is a lookahead insuring the speaker name is directly followed by a new line (avoids false positive if a similar expression appears in a spoken line)
\n* matches newlines
(.*?) matches everything (including new lines thanks to DOTALL) until the next part of the expression (? makes it lazy instead of greedy)
\n* matches newlines
(?=\b[A-Z ]{3,}\n|$) is a lookahead i.e. a non capturing expression insuring the following part is either a speaker name or the end of your string
Output:
[('ARTHUR', "Yeah. I mean, that's just--"), ('SOCIAL WORKER', 'Does my reading it upset you?\n\nHe leans in.'), ('ARTHUR', "No. I just,-- some of it's\n\npersonal. You know?"), ('SOCIAL WORKER', "I understand. I just want to make\n\nsure you're keeping up with it.\n\nShe slides his journal back to him. He holds it in his lap.")]
You'll have to adjust formatting if you want to remove actions from the result though.
My aim is to remove all punctuations from a string so that I can then get the frequency of each word in the string.
My string is:
WASHINGTON—Coming to the realization in front of millions of viewers
during the broadcast of his show, a horrified Tucker Carlson stated,
‘I…I am the mainstream media’ Wednesday as he began spiraling live on
air. “We’ve discovered evidence of rampant voter fraud, and the
president has every right to call for an investigation even if the
mainstream media thinks...” said Carlson, who trailed off, stared down
at his shaking hands, and felt a sudden ringing in his ears as he
looked back up and zeroed in on the production crew surrounding him.
“The media says…wait. Those liars on TV will try to tell you…oh God.
We’re the number-one program on cable news, aren’t we? Fox News…Fox
‘News.’ It’s the media. It’s me. This can’t be. No, no, no, no. Jesus
Christ, I make $6 million a year. Get that camera off me!” At press
time, Carlson had torn the microphone from his lapel and fled the set
in panic.
source: https://www.theonion.com/i-i-am-the-mainstream-media-realizes-horrified-tuc-1845646901
I want to remove all punctuations from it. I do that like this -
s.translate(str.maketrans('', '', string.punctuation))
This is the output -
WASHINGTON—Coming to the realization in front of millions of viewers
during the broadcast of his show a horrified Tucker Carlson stated
‘I…I am the mainstream media’ Wednesday as he began spiraling live on
air “We’ve discovered evidence of rampant voter fraud and the
president has every right to call for an investigation even if the
mainstream media thinks” said Carlson who trailed off stared down at
his shaking hands and felt a sudden ringing in his ears as he looked
back up and zeroed in on the production crew surrounding him “The
media says…wait Those liars on TV will try to tell you…oh God We’re
the numberone program on cable news aren’t we Fox News…Fox ‘News’ It’s
the media It’s me This can’t be No no no no Jesus Christ I make 6
million a year Get that camera off me” At press time Carlson had torn
the microphone from his lapel and fled the set in panic
As you can see that characters/ string like ", — and ... still exist. Am I incorrectly expecting them to be removed too? If the output is correct then how can I NOT differentiate between "`News`" and "News"?
>>> import string
>>> "“" in string.punctuation
False
>>> "—" in string.punctuation
False
Welcome to the wonderful world of Unicode where, among many other things, … is not three concatenated full stop periods and :
>>> import unicodedata
>>> unicodedata.name('—')
'EM DASH'
is not a hyphen.
How you want to handle the full scope of what could be considered 'punctuation' across the Unicode table is probably out of scope for this question, but you could either come up with your own ad-hoc list or use a third-party library designed for that type of text manipulation. Here is one such approach:
Best way to strip punctuation from a string
I added the list of characters you can remove from string by using your implementation.
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
You can check this implementation to remove all special characters and keep whitespaces
''.join(e for e in s if e.isalnum() or e == ' ')
It looks like the … and a couple of the other characters you are having trouble with are special Unicode characters. A workaround is to use string.isalpha(), which tells you whether the characters of a string are part of the alphabet or not.
result = ""
for x in string:
if x.isalpha() or x == " ":
result = result + x
I have a text that I need to clean for further processing.
Here is the sample text:
Nigel Reuben Rook Williams (15 July 1944 – 21 April 1992) was an
English conservator and expert on the restoration of ceramics and
glass. From 1961 until his death he worked at the British Museum,
where he became the Chief Conservator of Ceramics and Glass in 1983.
There his work included the successful restorations of the Sutton Hoo
helmet and the Portland Vase.
Joining as an assistant at age 16, Williams spent his entire career,
and most of his life, at the British Museum. He was one of the first
people to study conservation, not yet recognised as a profession, and
from an early age was given responsibility over high-profile objects.
In the 1960s he assisted with the re-excavation of the Sutton Hoo
ship-burial, and in his early- to mid-twenties he conserved many of
the objects found therein: most notably the Sutton Hoo helmet, which
occupied a year of his time. He likewise reconstructed other objects
from the find, including the shield, drinking horns, and maplewood
bottles.
The "abiding passion of his life" was ceramics,[4] and the 1970s and
1980s gave Williams ample opportunities in that field. After nearly
31,000 fragments of shattered Greek vases were found in 1974 amidst
the wreck of HMS Colossus, Williams set to work piecing them together.
The process was televised, and turned him into a television
personality. A decade later, in 1988 and 1989, Williams's crowning
achievement came when he took to pieces the Portland Vase, one of the
most famous glass objects in the world, and put it back together. The
reconstruction was again televised for a BBC programme, and as with
the Sutton Hoo helmet, took nearly a year to complete.
I need to:
split the text into sentences (by the full stop symbol '.'), eliminating the full stop symbol
split the sentences into words (only latin alphabet letters), other symbols should be replaced by the space character and only single spaces should be used to separate those words
Show all text in lowercase
I'm using a Mac and I get this code running:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
fread = open('source.txt')
fwrite = open('result.txt','w+')
for line in fread:
new_line = line
# split the text into sentences
new_line = re.sub(r"\." , "\r", new_line)
# change all uppercase letters to lowercase
new_line = new_line.lower()
# only latin letters
new_line = re.sub("[^a-z\s]", " ", new_line)
# The words should be separated by single spaces.
new_line = re.sub(r" +"," ", new_line)
# Getting rid of space in the beginning of the sentence
new_line = re.sub(r"ˆ\s+", "", new_line)
fwrite.write(new_line)
fread.close()
fwrite.close()
The result was not quite as expected. The spaces in the beginning of each line were not deleted. I ran the same code in a Windows machine and I noticed that sometines the full stop was replaced by the and some other times by . So I'm not sure what is happening.
Here is a sample of the result. Since spaces where not shown in stackoverflow, I had to show text as code:
nigel reuben rook williams july april was an english conservator and expert on the restoration of ceramics and glass
from until his death he worked at the british museum where he became the chief conservator of ceramics and glass in
there his work included the successful restorations of the sutton hoo helmet and the portland vase
joining as an assistant at age williams spent his entire career and most of his life at the british museum
he was one of the first people to study conservation not yet recognised as a profession and from an early age was given responsibility over high profile objects
in the s he assisted with the re excavation of the sutton hoo ship burial and in his early to mid twenties he conserved many of the objects found therein most notably the sutton hoo helmet which occupied a year of his time
he likewise reconstructed other objects from the find including the shield drinking horns and maplewood bottles
the abiding passion of his life was ceramics and the s and s gave williams ample opportunities in that field
after nearly fragments of shattered greek vases were found in amidst the wreck of hms colossus williams set to work piecing them together
the process was televised and turned him into a television personality
a decade later in and williams s crowning achievement came when he took to pieces the portland vase one of the most famous glass objects in the world and put it back together
the reconstruction was again televised for a bbc programme and as with the sutton hoo helmet took nearly a year to complete
Different characters may not appear as I see, for instance, before joining I see two ?? using TextWrangler.
Using the lstrip() function works to delete the spaces in the beginning of each sentence, by the way.
Why don't <new_line = re.sub(r"ˆ\s+", "", new_line)> work?
I suspect that the '\n' used to mark the end of the line is generating some problems.
# split the sentences into words
new_line = re.sub("[^a-z\s]", " ", new_line)
This isn't doing what the comment says. It's actually replacing all non-letter, non-space characters with a space, which is why your output is missing numbers and punctuation.
# Getting rid of space in the beginning of the sentence
new_line = re.sub(r"ˆ\s+", "", new_line)
I don't know what character is at the front of that regex, but it's not the beginning-of-line character ^.
A few mentions here:
Use context manager for in/out files because it handles closing after usage by default.
You have a wrong character as John Gordon say.
I recommend using some regex visualization tool (i.e. https://jex.im/regulex/)
Basic approach to replace something with only whitespace is to use plus operator [^a-z]+: (non-alphabet chars)+(one and more).
So the final code snippet I've made
# !/usr/bin/env python
# -*- coding: utf-8 -*-
import re
# It's better to use context manager to read files.
# You don't have to explicitly close those files after reading.
with open('./source.txt', 'r') as source:
text = ''
for line in source:
text += line.lower() # Lower case on reading, why not.
# only latin letters & single spaces at the same time
text = re.sub("[^a-z.]+", " ", text)
# # replace dots with newlines
text = re.sub(r'\.', r'\n', text)
with open('./result.txt', 'w+') as output:
output.write(text)
I have a text file that follows this format.
LESTER HOLT (00:00:01): Breaking News Tonight: A deadly mass shooting
at the airport. A gunman opens fire at baggage claim in Fort
Lauderdale, witnesses describing scenes of sheer horror. A silent
killer shooting people in the head as they tried to run and hide.
Tonight, a storm of questions. Why did he do it? The suspect, a
passenger with a firearm in his checked bag. New concerns about
airport security before the checkpoint.
(00:00:25): Also breaking tonight the new report from U.S.
intelligence: Vladimir Putin himself ordered the effort to influence
the election, aimed at hurting Clinton and helping Trump win. What the
President-elect is saying after his top-secret briefing.
(00:00:39): And States of Emergency: Millions from coast to coast
paralyzed by a massive winter storm.
(00:00:45): NIGHTLY NEWS begins right now.
I am trying to parse this information into a Python Dictionary, where the speaker is a dictionary, of dictionaries, which has timecode keys, and the content is the value, I can't consistently split because of potential information before the timecode, (IE the first quote), as well as the fact that the split character : is also a character involved with the timecode itself 00:00:00.
Trying to split according to the regex.
for line in msg.get_payload().split('\n'):
regex = r'\d{2}:\d{2}:\d{2}'
test = re.split(regex, line)
print(test)
sleep(1)
Appears to work in splitting it properly, but it causes me to lose the value I am splitting on (timecode), which I intend to use as a key. How can I properly split the above content, get the speaker, and then get the timecode as a key, and the content as a value.It is possible he may be present later in the text as well, and it should append to the list of timecodes./
The output format I am targeting is something along the lines of
{speakers:{'Lester Holt': {'00:00:01':content..., '00:00:0025': content...},
'speaker2':{etc,etc,etc} }}
Ive tried using the split as mentioned above, but it removes my timecode variable.
Any thoughts and guidance is appreciated.
Don't bother with split. You're trying to get 2-3 pieces of information out of each line, so try the following:
for line in msg.get_payload().split('\n'):
match = re.search(r'^\s*([^(]*?)\s*\((\d{2}:\d{2}:\d{2})\):\s*(.*)$', line)
if match:
(speaker, time, message) = match.groups()
Speaker will be an empty string if none was present on that line.
Regex explanation:
^ # Start of line
\s* # Drop leading whitespace
([^(]*?) # Capture the speaker if present (non-paren characters)
\s* # Drop whitespace between name and time
\( # Drop literal open paren
(\d{2}:\d{2}:\d{2}) # Capture time
\):\s* # Drop literal close paren, colon and whitespace
(.*) # Capture the rest of the line
$ # End of line
Splitting message in lines when you need to split it in time-stamped paragraphs is a waste. re.split can easily save the tokens that it split on, if you only look at the documentation. Here's my solution:
toks = re.split(r"\((\d\d:\d\d:\d\d)\):", msg.get_payload())[1:]
answer = dict(zip(toks[::2], toks[1::2]))
This creates a dictionary of timestamps and paragraphs. Feel free to use the same approach to split by speaker as well.
Result:
{
'00:00:01': ' Breaking News Tonight: A .....',
'00:00:25': ' Also breaking tonight ......', ....
}
Let's say we have text within which some quotes are stored in the form:
user:quote
we can have multiple quotes within a text.
Agatha Drake: She records her videos from the future? What is she, a
f**ing time lord? Is she Michael J. Fox?
Harvey Spencer: This is just like that one movie where that one guy
changed one tiny, little thing in his childhood to stop the girl of
his dreams from being a crackhead in the future!
How can i extract the quotes (She records her videos from ..., This is just like that one movie....) from the text in python?
I tried
re.findall('\S\:\s?(.*)', text)
But it's not doing the job.
https://regex101.com/r/vH63Go/1
How can I do it in Python?
If your string is following the consistent format of user at the start of a line and double newlines ending a quote, you could use this:
(?m)^[^:\n]+:\s?((?:.+\n?)*)
It uses multiline mode and matches the start of a line, followed by characters that are neither : nor newline, folllowed by :. Then captures all following lines with content.
Here's a demo on regex101.