Python : Word by word Text Processing between two files

Python : Word by word Text Processing between two files - python

I'm new to NLP. I have two text files. First file has dialogues formatted properly like below .
RECEPTIONIST Can I help you?
LINCOLN Yes. Um, is this the State bank?
RECEPTIONIST If you have to ask, maybe you shouldn't be here.
SARAH I think this is the place.
RECEPTIONIST Fill in the query.
LINCOLN Thank-you. We'll be right back.
RECEPTIONIST Oh, take your time. I'll just finish my crossword puzzle.
oh, wait.
The Second text file has 7 columns . In 5th column I have the words sequence from the dialogues of like below .
Column 5
Can
I
help
you
?
yes
.
Um
,
The Full stop and commas are considered as words here and if it has 3 or more full stop's together like "..." then it should be considered as a single word. Also if the words "Thank-you" (because they don't have space in between them) should be considered as a single word.
Now I want to write a script in python to compare each word from the dialogues and then make a new column (Column 8) which should show " who speaks the word " . Like below
Column 5 Column 8
Can RECEPTIONIST
I RECEPTIONIST
help RECEPTIONIST
you RECEPTIONIST
? RECEPTIONIST
yes LINCOLN
. LINCOLN
Um LINCOLN
, LINCOLN
As I'm completely new to python environment. I dont know where to start .Please provide your suggestion and any tips to coding!
The first file has the dialogues and the second file has information about the dialogues

I suggest the following steps to perform:
Process text file 1
here you want to split the string like LEONARD Agreed, what's your pointinto
a set of tokens. A naive approach is to use split(" ") which will split the text based on spaces, however you also need to take in consideration punctuations.
I suggest to use NLTK, a python library for natural language processing. A basic example will show how this might help you:
import nltk
sentence = """Hi this is a test."""
tokens = nltk.word_tokenize(sentence)
# output: tokens
['Hi', 'this', "is", 'a', 'test', '.']
Once you have tokenised each sentence correctly, you will know how many lines it will have in the second text file.
Process text file 2
Now you will iterate over each line in the second text file, you check if the word matches the supposed token which you found in the first step. If this is the case you add the first token (the name of the person who said it) to the end of the line (column 8).
You can get the word from the string TheBigBangTheory.Season01.Episode01.en 1 59.160 0.070 you 0.990 lexby simply doing sentence.split(" ")[4], which returns youin this case.
I believe it will still need some tweaking, but I'll leave that to you. This might outline the general idea.
Goodluck, Bazinga!

Related

Regex for multiple lines separated with "return" and multiple unnecessary spaces

I was trying to parse together a script for a movie into a dataset containing two columns 'speaker_name' and 'line_spoken'. I don't have any issue with the Python part of the problem but parsing the script is the problem.
The schema of the script goes like this:
Now, this, if copied and pasted into a .txt file is something like this:
ARTHUR
Yeah. I mean, that's just--
SOCIAL WORKER
Does my reading it upset you?
He leans in.
ARTHUR
No. I just,-- some of it's
personal. You know?
SOCIAL WORKER
I understand. I just want to make
sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
In the above case, the regex filtering should return the speaker name and the dialogue and not what is happening in actions like the last line: "slides his journal back". The dialogues often exceed more than two lines so please do not provide hard-coded solutions for 2 lines only. I think I am thinking about this problem in just one direction, some other method to filter can also work.
I have worked with scripts that are colon-separated and I don't have any problem parsing those. But in this case, I am getting no specific endpoints to end the search at. It would be a great help if the answer you give has 2 groups, one with name, the other with the dialogue. Like in the case of colon-separated, my regex was:
pattern = r'(^[a-zA-z]+):(.+)'
Also, if possible, please try and explain why you used that certain regex. It will be a learning experience for me.

Use https://www.onlineocr.net/ co convert pdf to text,
It shows immediately the outcome, where names and on the same line with dialogs,
which could allow for a simple processing
ARTHUR Yeah. I mean, that's just--
SOCIAL WORKER Does my reading it upset you?
He leans in.
ARTHUR No. I just,-- some of its personal. You know me ?
SOCIAL WORKER I understand. I just want to make sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
Not sure will it work for longer dialogs.
Another solution is to extract data from the text file that you can download by clicking the "download output file" link . That file is formatted differently. In that file
10 leading spaces will indicate the dialog, and 5 leading spaces the name - a the least for you sample screenshot
The regex is
r" (.+)(\n( [^ ].+\n)+)"
https://regex101.com/r/FQk8uH/1
it puts in group 1 whatever starts with ten spaces and whatever starts with at the exactly five space into the second :
the subexpression " [^ ].+\n" denotes a line where the first five symbols are spaces, the sixth symbol is anything but space, and the rest of symbols until the end of line are arbitrary. Since dialogs tend to be multiline that expression is followed with another plus.
You will have to delete extra white space from dialogue with additional code and/or regex.
If the amount of spaces varies a bit (say 4-6 and 7 - 14 respectively) but has distinct section the regex needs to be adjusted by using variable repetition operator (curly braces {4, 6}) or optional spaces ?.
r" {7, 14}(.+)(\n( {4-6}[^ ].+\n)+)"
The last idea is to use preexisting list of names in play to match them e.g. (SOCIAL WORKER|JOHN|MARY|ARTUR). The https://www.onlineocr.net/ website still could be used to help spot and delete actions

In Python, you can use DOTALL:
re_pattern = re.compile(r'(\b[A-Z ]{3,}(?=\n))\n*(.*?)\n*(?=\b[A-Z ]{3,}\n|$)', re.DOTALL)
print(re.findall(re_pattern, mystr))
\b[A-Z ]{3,}(?=\n) matches speaker name.
\b matches a word boundary
[A-Z ]{3,} matches three or more upper case letters or spaces. (this means this regex won't recognize speaker names with less than three characters. I did this to avoid false positives in special cases but you might wanna change it. Also check what kind of characters might occur in speaker name (dots, minus, lower case...))
(?=\n) is a lookahead insuring the speaker name is directly followed by a new line (avoids false positive if a similar expression appears in a spoken line)
\n* matches newlines
(.*?) matches everything (including new lines thanks to DOTALL) until the next part of the expression (? makes it lazy instead of greedy)
\n* matches newlines
(?=\b[A-Z ]{3,}\n|$) is a lookahead i.e. a non capturing expression insuring the following part is either a speaker name or the end of your string
Output:
[('ARTHUR', "Yeah. I mean, that's just--"), ('SOCIAL WORKER', 'Does my reading it upset you?\n\nHe leans in.'), ('ARTHUR', "No. I just,-- some of it's\n\npersonal. You know?"), ('SOCIAL WORKER', "I understand. I just want to make\n\nsure you're keeping up with it.\n\nShe slides his journal back to him. He holds it in his lap.")]
You'll have to adjust formatting if you want to remove actions from the result though.

Python text to sentences when uppercase word appears

I am using Google Speech-to-Text API and after I transcribe an audio file, I end up with a text which is a conversation between two people and it doesn't contain punctuation (Google's automatic punctuation or speaker diarization features are not supported for this non-English language). For example:
Hi you are speaking with customer support how can i help you Hi my name is whatever and this is my problem Can you give me your address please Yes of course
It appears as one big sentence, but I want to split the different sentences whenever an uppercase word appears, and thus have:
Hi you are speaking with customer support how can i help you
Hi my name is whatever and this is my problem
Can you give me your address please
Yes of course
I am using Python and I don't want to use regex, instead I want to use a simpler method. What should I add to this code in order to split each result into multiple sentences as soon as I see an uppercase letter?
# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
for i, result in enumerate(response.results):
transcribed_text = []
# The first alternative is the most likely one for this portion.
alternative = result.alternatives[0]
print("-" * 20)
print("First alternative of result {}".format(i))
print("Transcript: {}".format(alternative.transcript))

A simple solution would be a regex split:
inp = "Hi you are speaking with customer support how can i help you Hi my name is whatever and this is my problem Can you give me your address please Yes of course"
sentences = re.split(r'\s+(?=[A-Z])', inp)
print(sentences)
This prints:
['Hi you are speaking with customer support how can i help you',
'Hi my name is whatever and this is my problem',
'Can you give me your address please',
'Yes of course']
Note that this simple approach can easily fail should there be things like proper names in the middle of sentences, or maybe acronyms, both of which also have uppercase letters but are not markers for the actual end of the sentence. A better long term approach would be to use a library like nltk, which has the ability to find sentences with much higher accuracy.

Using regex to split text content into dictionary

I have a text file that follows this format.
LESTER HOLT (00:00:01): Breaking News Tonight: A deadly mass shooting
at the airport. A gunman opens fire at baggage claim in Fort
Lauderdale, witnesses describing scenes of sheer horror. A silent
killer shooting people in the head as they tried to run and hide.
Tonight, a storm of questions. Why did he do it? The suspect, a
passenger with a firearm in his checked bag. New concerns about
airport security before the checkpoint.
(00:00:25): Also breaking tonight the new report from U.S.
intelligence: Vladimir Putin himself ordered the effort to influence
the election, aimed at hurting Clinton and helping Trump win. What the
President-elect is saying after his top-secret briefing.
(00:00:39): And States of Emergency: Millions from coast to coast
paralyzed by a massive winter storm.
(00:00:45): NIGHTLY NEWS begins right now.
I am trying to parse this information into a Python Dictionary, where the speaker is a dictionary, of dictionaries, which has timecode keys, and the content is the value, I can't consistently split because of potential information before the timecode, (IE the first quote), as well as the fact that the split character : is also a character involved with the timecode itself 00:00:00.
Trying to split according to the regex.
for line in msg.get_payload().split('\n'):
regex = r'\d{2}:\d{2}:\d{2}'
test = re.split(regex, line)
print(test)
sleep(1)
Appears to work in splitting it properly, but it causes me to lose the value I am splitting on (timecode), which I intend to use as a key. How can I properly split the above content, get the speaker, and then get the timecode as a key, and the content as a value.It is possible he may be present later in the text as well, and it should append to the list of timecodes./
The output format I am targeting is something along the lines of
{speakers:{'Lester Holt': {'00:00:01':content..., '00:00:0025': content...},
'speaker2':{etc,etc,etc} }}
Ive tried using the split as mentioned above, but it removes my timecode variable.
Any thoughts and guidance is appreciated.

Don't bother with split. You're trying to get 2-3 pieces of information out of each line, so try the following:
for line in msg.get_payload().split('\n'):
match = re.search(r'^\s*([^(]*?)\s*\((\d{2}:\d{2}:\d{2})\):\s*(.*)$', line)
if match:
(speaker, time, message) = match.groups()
Speaker will be an empty string if none was present on that line.
Regex explanation:
^ # Start of line
\s* # Drop leading whitespace
([^(]*?) # Capture the speaker if present (non-paren characters)
\s* # Drop whitespace between name and time
\( # Drop literal open paren
(\d{2}:\d{2}:\d{2}) # Capture time
\):\s* # Drop literal close paren, colon and whitespace
(.*) # Capture the rest of the line
$ # End of line

Splitting message in lines when you need to split it in time-stamped paragraphs is a waste. re.split can easily save the tokens that it split on, if you only look at the documentation. Here's my solution:
toks = re.split(r"\((\d\d:\d\d:\d\d)\):", msg.get_payload())[1:]
answer = dict(zip(toks[::2], toks[1::2]))
This creates a dictionary of timestamps and paragraphs. Feel free to use the same approach to split by speaker as well.
Result:
{
'00:00:01': ' Breaking News Tonight: A .....',
'00:00:25': ' Also breaking tonight ......', ....
}

Removing "\n"s when printing sentences from text file in python?

I am trying to print a list of sentences from a text file (one of the Project Gutenberg eBooks). When I print the file as a single string string it looks fine:
file = open('11.txt','r+')
alice = file.read()
print(alice[:500])
Output is:
ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d
Now, when I split it into sentences (The assignment was specifically to do this by "splitting at the periods," so it's a very simplified split), I get this:
>>> print(sentences[:5])
["ALICE'S ADVENTURES IN WONDERLAND\n\nLewis Carroll\n\nTHE MILLENNIUM FULCRUM EDITION 3", '0\n\n\n\n\nCHAPTER I', " Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversations?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her", "\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!\nOh dear! I shall be late!' (when she thought it over afterwards, it\noccurred to her that she ought to have wondered at this, but at the time\nit all seemed quite natural); but when the Rabbit actually TOOK A WATCH\nOUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,\nAlice started to her feet, for it flashed across her mind that she had\nnever before seen a rabbit with either a waistcoat-pocket, or a watch\nto take out of it, and burning with curiosity, she ran across the field\nafter it, and fortunately was just in time to see it pop down a large\nrabbit-hole under the hedge", '\n\nIn another moment down went Alice after it, never once considering how\nin the world she was to get out again']
Where do the extra "\n" characters come from and how can I remove them?

If you want to replace all the newlines with one space, do this:
import re
new_sentences = [re.sub(r'\n+', ' ', s) for s in sentences]

You may not want to use regex, but I would do:
import re
new_sentences = []
for s in sentences:
new_sentences.append(re.sub(r'\n{2,}', '\n', s))
This should replace all instances of two or more '\n' with a single newline, so you still have newlines, but don't have "extra" newlines.
If you want to avoid creating a new list, and instead modify the existing one (credit to #gavriel and Andrew L.: I hadn't thought of using enumerate when I first posted my answer):
import re
for i, s in enumerate(sentences):
sentences[i] = re.sub(r'\n{2,}', '\n', s)
The extra newlines aren't really extra, by which I mean they are meant to be there and are visible in the text in your question: the more '\n' there are, the more space there is visible between the lines of text (i.e., one between the chapter heading and the first paragraph, and many between the edition and the chapter heading.

You'll understand where the \n characters come from with this little example:
alice = """ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d"""
print len(alice.split("."))
print len(alice.split("\n"))
It all depends the way you're splitting your text, the above example will give this output:
3
19
Which means there are 3 substrings if you were to split the text using . or 19 substrings if you splitted using \n as separator. You can read more about str.split
In your case you've splitted your text using ., so the 3 substrings will contain multiple newlines characters \n, to get rid of them you can either split these substrings again or just get rid of them using str.replace

The text uses newlines to delimit sentences as well as fullstops. You have an issue where just replacing the new line characters with an empty string will result in having words without spaces between them. Before you split alice by '.', I would use something along the lines of #elethan's solution to replace all of the multiple new lines in alice with a '.' Then you could do alice.split('.') and all of the sentences separated with multiple new lines would be split appropriately along with the sentences separated with . initially.
Then your only issue is the decimal point in the version number.

file = open('11.txt','r+')
file.read().split('\n')

Splitting individual sentence to list

I am asking on how to make individual lists. Not how to find a substring as marked duplicated for.
I have the following file
'Gentlemen do not read each others mail.' Henry Stinson
'The more corrupt the state, the more numerous the laws.' Tacitus
'The price of freedom is eternal vigilance.' Thomas Jefferson
'Few false ideas have more firmly gripped the minds of so many intelligent men than the one that, if they just tried, they could invent a cipher that no one could break.' David Kahn
'Who will watch the watchmen.' Juvenal
'Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin.' John Von Neumann
'They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.' Benjamin Franklin
'And so it often happens that an apparently ingenious idea is in fact a weakness which the scientific cryptographer seizes on for his solution.' Herbert Yardley
I am trying to convert each sentence to a list so that when I search for the word say "Gentlemen" it should print me the entire sentence.
I am able to get the lines to split but I am unable to convert them to individual list. I have tried a few things from the internet but nothing has helped so far.
here is what
def myFun(filename):
file = open(filename, "r")
c1 = [ line for line in file ]
for i in c1:
print(i)

you can use in to search a string or array, for example 7 in a_list or "I" in "where am I"
you can iterate directly over a file if you want
for line in open("my_file.txt")
although to ensure it closes people recommend using a context manager
with open("my_file.txt") as f:
for line in f:
that should probably at least get you going in the right direction
if you want to search case insensitive you can simply use str.lower()
term.lower() in search_string.lower() #case insensitive

Python strings have a split() method:
individual_words = 'This is my sentence.'.split()
print(len(individual_words)) # 4
Edit: As #ShadowRanger mentions below, running split() without an argument will take care of leading, trailing, and consecutive whitespace.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python : Word by word Text Processing between two files - python

Related

Regex for multiple lines separated with "return" and multiple unnecessary spaces

Python text to sentences when uppercase word appears

Using regex to split text content into dictionary

Removing "\n"s when printing sentences from text file in python?

Splitting individual sentence to list

Categories

Resources