Match names, dialogues, and actions from transcript using regex

Match names, dialogues, and actions from transcript using regex - python

Given a string dialogue such as below, I need to find the sentence that corresponds to each user.
text = 'CHRIS: Hello, how are you...
PETER: Great, you? PAM: He is resting.
[PAM SHOWS THE COUCH]
[PETER IS NODDING HIS HEAD]
CHRIS: Are you ok?'
For the above dialogue, I would like to return tuples with three elements with:
The name of the person
The sentence in lower case and
The sentences within Brackets
Something like this:
('CHRIS', 'Hello, how are you...', None)
('PETER', 'Great, you?', None)
('PAM', 'He is resting', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD')
('CHRIS', 'Are you ok?', None)
etc...
I am trying to use regex to achieve the above. So far I was able to get the names of the users with the below code. I am struggling to identify the sentence between two users.
actors = re.findall(r'\w+(?=\s*:[^/])',text)

You can do this with re.findall:
>>> re.findall(r'\b(\S+):([^:\[\]]+?)\n?(\[[^:]+?\]\n?)?(?=\b\S+:|$)', text)
[('CHRIS', ' Hello, how are you...', ''),
('PETER', ' Great, you? ', ''),
('PAM',
' He is resting.',
'[PAM SHOWS THE COUCH]\n[PETER IS NODDING HIS HEAD]\n'),
('CHRIS', ' Are you ok?', '')]
You will have to figure out how to remove the square braces yourself, that cannot be done with regex while still attempting to match everything.
Regex Breakdown
\b # Word boundary
(\S+) # First capture group, string of characters not having a space
: # Colon
( # Second capture group
[^ # Match anything that is not...
: # a colon
\[\] # or square braces
]+? # Non-greedy match
)
\n? # Optional newline
( # Third capture group
\[ # Literal opening brace
[^:]+? # Similar to above - exclude colon from match
\]
\n? # Optional newlines
)? # Third capture group is optional
(?= # Lookahead for...
\b # a word boundary, followed by
\S+ # one or more non-space chars, and
: # a colon
| # Or,
$ # EOL
)

Regex is one way to approach this problem, but you can also think about it as iterating through each token in your text and applying some logic to form groups.
For example, we could first find groups of names and text:
from itertools import groupby
def isName(word):
# Names end with ':'
return word.endswith(":")
text_split = [
" ".join(list(g)).rstrip(":")
for i, g in groupby(text.replace("]", "] ").split(), isName)
]
print(text_split)
#['CHRIS',
# 'Hello, how are you...',
# 'PETER',
# 'Great, you?',
# 'PAM',
# 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]',
# 'CHRIS',
# 'Are you ok?']
Next you can collect pairs of consecutive elements in text_split into tuples:
print([(text_split[i*2], text_split[i*2+1]) for i in range(len(text_split)//2)])
#[('CHRIS', 'Hello, how are you...'),
# ('PETER', 'Great, you?'),
# ('PAM', 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]'),
# ('CHRIS', 'Are you ok?')]
We're almost at the desired output. We just need to deal with the text in the square brackets. You can write a simple function for that. (Regular expressions is admittedly an option here, but I'm purposely avoiding that in this answer.)
Here's something quick that I came up with:
def isClosingBracket(word):
return word.endswith("]")
def processWords(words):
if "[" not in words:
return [words, None]
else:
return [
" ".join(g).replace("]", ".")
for i, g in groupby(map(str.strip, words.split("[")), isClosingBracket)
]
print(
[(text_split[i*2], *processWords(text_split[i*2+1])) for i in range(len(text_split)//2)]
)
#[('CHRIS', 'Hello, how are you...', None),
# ('PETER', 'Great, you?', None),
# ('PAM', 'He is resting.', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD.'),
# ('CHRIS', 'Are you ok?', None)]
Note that using the * to unpack the result of processWords into the tuple is strictly a python 3 feature.

Related

REGEX_String between strings in a list

From this list:
['AUSTRALIA\nBELMONT PARK (WA', '\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (QLD']
I would like to reduce it to this list:
['BELMONT PARK', 'EAGLE FARM']
You can see from the first list that the desired words are between '\n' and '('.
My attempted solution is:
for i in x:
result = re.search('\n(.*)(', i)
print(result.group(1))
This returns the error 'unterminated subpattern'.
Thankyou

You’re getting an error because the ( is unescaped. Regardless, it will not work, as you’ll get the following matches:
\nBELMONT PARK (
\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (
You can try the following:
(?<=\\n)(?!.*\\n)(.*)(?= \()
(?<=\\n): Positive lookbehind to ensure \n is before match
(?!.*\\n): Negative lookahead to ensure no further \n is included
(.*): Your match
(?= \(): Positive lookahead to ensure ( is after match

You can get the matches without using any lookarounds, as you are already using a capture group.
\n(.*) \(
Explanation
\n Match a newline
(.*) Capture group 1, match any character except a newline, as much as possible
\( Match a space and (
See a regex101 demo and a Python demo.
Example
import re
x = ['AUSTRALIA\nBELMONT PARK (WA', '\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (QLD']
pattern = r"\n(.*) \("
for i in x:
m = re.search(pattern, i)
if m:
print(m.group(1))
Output
BELMONT PARK
EAGLE FARM
If you want to return a list:
x = ['AUSTRALIA\nBELMONT PARK (WA', '\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (QLD']
pattern = r"\n(.*) \("
res = [m.group(1) for i in x for m in [re.search(pattern, i)] if m]
print(res)
Output
['BELMONT PARK', 'EAGLE FARM']

Replacing acronyms with their full forms in Python

I have an acronym dictionary that has keys as an acronym and values as full forms.
I want to replace the acronyms found in the text_list with the full forms to arrive at the ouput_list
acronym_dict = {
'QUO': 'Quotation',
'IN': 'India',
'SW': 'Software',
'RE': 'Regular Expression'
}
text_list = [
'The status QUO has changed', 'I SWEAR, This is part of IN_SW',
'The update does not belong to the SW, version, branch',
'This is a RE_Text'
]
output_list = [
'The status Quotation has changed',
'I SWEAR, This is part of India_Software',
'The update does not belong to the Software, version, branch',
'This is Regular Expression_Text'
]
I wrote a method to do that
import string
def remove_punctuations(text):
punct_str = string.punctuation # !"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~
for punctuation in punct_str:
text = text.replace(punctuation, ' ')
return text.strip()
def replace_single_acronym(text, acronym, fullform):
words = text.split()
return_words = []
for w in words:
if remove_punctuations(w).lower() == acronym.lower():
return_words.append(w.replace(acronym, fullform))
else:
return_words.append(w)
return " ".join(return_words)
my_op_list = []
for text in text_list:
for acronym in acronym_dict.keys():
text = replace_single_acronym(text, acronym, acronym_dict[acronym])
my_op_list.append(text)
Ideally output_list and my_op_list should look the same. It prints the below result (failing in 2 instances)
['The status Quotation has changed',
'I SWEAR, This is part of IN_SW',
'The update does not belong to the Software, version, branch',
'This is a RE_Text']
Also, the method replace_single_acronym is very slow on a corpus of 1000 text_list items.
Can someone help me in adjusting the method to do it in the right and efficient way?

You might use re.sub for this task by delivering function as 2nd argument following way
import re
acronym_dict = {
'QUO': 'Quotation',
'IN': 'India',
'SW': 'Software',
'RE': 'Regular Expression'
}
text_list = [
'The status QUO has changed', 'I SWEAR, This is part of IN_SW',
'The update does not belong to the SW, version, branch',
'This is a RE_Text'
]
def get_full_name(m):
return acronym_dict.get(m.group(1),m.group(1))
def replace_acronyms(text):
return re.sub(r'(?<![A-Z])([A-Z]+)(?![A-Z])', get_full_name, text)
output_list = [replace_acronyms(i) for i in text_list]
print(output_list)
output:
['The status Quotation has changed', 'I SWEAR, This is part of India_Software', 'The update does not belong to the Software, version, branch', 'This is a Regular Expression_Text']
Explanation: in pattern I used there are two zero-length assertions and one capturing group, it does find one or more uppercase ASCII letters, which are not preceded by ASCII uppercase letter (negative lookbehind) and not followed by ASCII uppercase letter (negative lookahead). get_full_name is function used as 2nd argument of re.sub thus it do accept single argument, which is match. m.group(1) denote content of sole capturing group I have used in pattern, it is acronym, I used .get function of dict so if given acronym is present in dict keys then use corresponding value, if it is not just use that acronym i.e. do not change anything.

regex match word and what comes after it

I need some help with a regex I am writing. I have a list of words that I want to match and words that might come after them (words meaning [A-Za-z/\s]+) I.e no parenthesis,symbols, numbers.
words = ['qtr','hard','quarter'] # keywords that must exist
test=['id:12345 cli hard/qtr Mix',
'id:12345 cli qtr 90%',
'id:12345 cli hard (red)',
'id:12345 cli hard work','Hello world']
excepted output is
['hard/qtr Mix', 'qtr', 'hard', 'hard work', None]
What I have tried so far
re.search(r'((hard|qtr|quarter)(?:[[A-Za-z/\s]+]))',x,re.I)

The problem with the pattern you have i.e.'((hard|qtr|quarter)(?:[[A-Za-z/\s]+]))', you have \s inside squared brackets [] which means to match the characters individually i.e. either \ or s, instead, you can just use space character i.e.
You can join all the words in words list by | to create the pattern '((qtr|hard|quarter)([a-zA-Z/ ]*))', then search for the pattern in each of strings in the list, if the match is found, take the group 0 and append it to the resulting list, else, append None:
pattern = re.compile('(('+'|'.join(words)+')([a-zA-Z/ ]*))')
result = []
for x in test:
groups = pattern.search(x)
if groups:
result.append(groups.group(0))
else:
result.append(None)
OUTPUT:
result
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]
And since you are including the space characters, you may end up with some values that has space at the end, you can just strip off the white space characters later.

Idea extracted from the existing answer and made shorter :
>>> pattern = re.compile('(('+'|'.join(words)+')([a-zA-Z/ ]*))')
>>> [pattern.search(x).group(0) if pattern.search(x) else None for x in test])
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]
As mentioned in comment :
But it is quite inefficient, because it needs to search for same pattern twice, once for pattern.search(x).group(0) and the other one for if pattern.search(x), and list-comprehension is not the best way to go about in such scenarios.
We can try this to overcome that issue :
>>> [v.group(0) if v else None for v in (pattern.search(x) for x in test)]
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]

You can put all needed words in or expression and put your word definition after that
import re
words = ['qtr','hard','quarter']
regex = r"(" + "|".join(words) + ")[A-Za-z\/\s]+"
p = re.compile(regex)
test=['id:12345 cli hard/qtr Mix(qtr',
'id:12345 cli qtr 90%',
'id:12345 cli hard (red)',
'id:12345 cli hard work','Hello world']
for string in test:
result = p.search(string)
if result is not None:
print(p.search(string).group(0))
else:
print(result)
Output:
hard/qtr Mix
qtr
hard
hard work
None

regular expression for the extracting multiple patterns

I have string like this
string="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""
tokens=re.findall('(.*)\r\n(.*?:)(.*?])',string)
Output
('Claim Status', '[Primary Status:', ' Paidup to Rebilled]')
('General Info.', '[PA Number:', ' R180126187]')
('Claim Insurance: Modified', '[Ins. Mode:', ' Primary]')
Wanted output:
('Claim Status', 'Primary Status:Paidup to Rebilled')
('General Info.', 'PA Number:R180126187')
('Claim Insurance: Modified', 'Ins. Mode:Primary','ICN: ########', 'Id: #########')

You may achieve what you need with a solution like this:
import re
s="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""
res = []
for m in re.finditer(r'^(.+)(?:\r?\n\s*\[(.+)])?\r?$', s, re.M):
t = []
t.append(m.group(1).strip())
if m.group(2):
t.extend([x.strip() for x in m.group(2).strip().split('], [') if ':' in x])
res.append(tuple(t))
print(res)
See the Python online demo. Output:
[('Claim Status', 'Primary Status: Paidup to Rebilled'), ('General Info.', 'PA Number: #######'), ('Claim Insurance: Modified', 'Ins. Mode: Primary', 'ICN: #######', 'Id: ########')]
With the ^(.+)(?:\r?\n\s*\[(.+)])?\r?$ regex, you match two consecutive lines with the second being optional (due to the (?:...)? optional non-capturing group), the first is captured into Group 1 and the subsequent one (that starts with [ and ends with ]) is captured into Group 2. (Note that \r?$ is necessary since in the multiline mode $ only matches before a newline and not a carriage return.) Group 1 value is added to a temporary list, then the contents of the second group is split with ], [ (if you are not sure about the amount of whitespace, you may use re.split(r']\s*,\s*\[', m.group(2))) and then only add those items that contain a : in them to the temporary list.

You are getting three elements per result because you are using "capturing" regular expressions. Rewrite your regexp like this to combine the second and third match:
re.findall('(.*)\r\n((?:.*?:)(?:.*?]))',string)
A group delimited by (?:...) (instead of (...)) is "non-capturing", i.e. it doesn't count as a match target for \1 etc., and it does not get "seen" by re.findall. I have made both your groups non-capturing, and added a single capturing (regular) group around them.

Python regex: Match ALL consecutive capitalized words

Short question:
I have a string:
title="Announcing Elasticsearch.js For Node.js And The Browser"
I want to find all pairs of words where each word is properly capitalized.
So, expected output should be:
['Announcing Elasticsearch.js', 'Elasticsearch.js For', 'For Node.js', 'Node.js And', 'And The', 'The Browser']
What I have right now is this:
'[A-Z][a-z]+[\s-][A-Z][a-z.]*'
This gives me the output:
['Announcing Elasticsearch.js', 'For Node.js', 'And The']
How can I change my regex to give desired output?

You can use this:
#!/usr/bin/python
import re
title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'
print re.findall(pattern, title)
A "normal" pattern can't match overlapping substrings, all characters are founded once for all. However, a lookahead (?=..) (i.e. "followed by") is only a check and match nothing. It can parse the string several times. Thus if you put a capturing group inside the lookahead, you can obtain overlapping substrings.

There's probably a more efficient way to do this, but you could use a regex like this:
(\b[A-Z][a-z.-]+\b)
Then iterate through the capture groups like so testing with this regex: (^[A-Z][a-z.-]+$) to ensure the matched group(current) matches the matched group(next).
Working example:
import re
title = "Announcing Elasticsearch.js For Node.js And The Browser"
matchlist = []
m = re.findall(r"(\b[A-Z][a-z.-]+\b)", title)
i = 1
if m:
for i in range(len(m)):
if re.match(r"(^[A-Z][a-z.-]+$)", m[i - 1]) and re.match(r"(^[A-Z][a-z.-]+$)", m[i]):
matchlist.append([m[i - 1], m[i]])
print matchlist
Output:
[
['Browser', 'Announcing'],
['Announcing', 'Elasticsearch.js'],
['Elasticsearch.js', 'For'],
['For', 'Node.js'],
['Node.js', 'And'],
['And', 'The'],
['The', 'Browser']
]

If your Python code at the moment is this
title="Announcing Elasticsearch.js For Node.js And The Browser"
results = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title)
then your program is skipping odd numbered pairs. An easy solution would be to research the pattern after skipping the first word like this:
m = re.match("[A-Z][a-z]+[\s-]", title)
title_without_first_word = title[m.end():]
results2 = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title_without_first_word)
Now just combine results and result2 together.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match names, dialogues, and actions from transcript using regex - python

Related

REGEX_String between strings in a list

Replacing acronyms with their full forms in Python

regex match word and what comes after it

regular expression for the extracting multiple patterns

Python regex: Match ALL consecutive capitalized words

Categories

Resources