I am trying to find a way to parse a string of a transcript into speaker segments (as a list). Speaker labels are denoted by the upper-casing of the speaker's name followed by a colon. The problem I am having is some names have a number of non upper-case characters. Examples might include the following:
OBAMA: said something
O'MALLEY: said something else
GOV. HICKENLOOPER: said something else entirely'
I have written the following regex, but I am struggling to get it to work:
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+(\ |\.|\'|\d)*[A-Z]*:)', mystring)
What I think I have written (and ideally what I want to do) is a command to split the string based on:
1. Find a newline
2. Use positive look-ahead for one or more uppercase characters
3. If upper-case characters are found look for optional characters from the list of periods, apostrophes, single spaces, and digits
4. If these optional characters are found, look for additional uppercase characters.
5. Crucially, find a colon symbol at the end of this sequence.
EDIT: In many cases, the content of the speech will have newline characters contained within it, and possibly colon symbols. As such, the only thing separating the speaker label from the content of speech is the sequence mentioned above.
just change (\ |.|\'|\d) to [\ .\'\d] or (?:\ |.|\'|\d)
import re
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+[\ \.\'\d]*[A-Z]*:)', mystring)
print(parse_turns)
If it's true that the speaker's name and what they said are separated by a colon, then it might be simpler to move away from regex to do your splitting.
list_of_things = []
mystring = "OBAMA: Hi\nO'MALLEY: True Dat\nHUCK FINN: Sure thing\n"
lines = mystring.split("\n")# 1st split the string into lines based on the \n character
for line in lines:
colon_pos = line.find(":",0) # Finds the position of the first colon in the line
speaker, utterance = line[0:colon_pos].strip(), line[colon_pos+1:].strip()
list_of_things.append((speaker, utterance))
At the end, you should have a neat list of tuples containing speakers, and the things they said.
Related
I'm working on an assignment for class (in jupyter) where I need to write a regex statement in Python that will return all lines (the document is already split apart) with a word completely capitalized at the end.
For example, if I have the following sentences:
I like APPLES.
I like oranges.
I like POTATOES.
The first and third sentence should come back to me. Here's what I've tried:
my_regex = r".*[^a-z][A-Z]$"
[line for line in poemlines if re.search(my_regex, line)]
and
my_regex = r"\b[^a-z][A-Z]{0,10}$"
[line for line in poemlines if re.search(my_regex, line)]
You tried:
r".*[^a-z][A-Z]$"
This matches anything .*, followed by a character that's not a single lowercase letter [^a-z], followed by a character that's a single uppercase letter [A-Z] and then the end of the string $.
And you tried:
r"\b[^a-z][A-Z]{0,10}$"
This matches a word boundary \b, followed by a character that's not a single lowercase letter [^a-z], followed by 0-10 uppercase letters [A-Z]{0,10} and then the end of the string $.
I find that often programmers new to regexes are confused by them and end up just trying a bunch of stuff - it really helps to understand the meaning of every character in your regex and online regex tools can help with that, as they tend of offer a full explanation (like regex101.com)
Something like this makes sense:
import re
poemlines = ['I like APPLES.', 'I like oranges.', 'I like POTATOES.']
my_regex = r'\b[A-Z]+\.$'
print([line for line in poemlines if re.search(my_regex, line, re.MULTILINE)])
This works because it matches a word boundary \b, followed by an uppercase letter one or more times [A-Z]+, followed by a literal period \. and then the end of the string $.
See the explanation here https://regex101.com/r/XstTcc/1 and play around with it. Note that it also works without the , re.MULTILINE, but this is because each line is its own string. In a situation where you'd be matching in a text with multiple lines, you'd want that there.
I'm trying to replace a string element, but only if it doesn't have additional characters after the match, though the characters before the match can vary... For example, if I tokenize a name containing underscores, and I want to replace anything that ends with "R", but not elements that start with it... so it would replace "R", or "SideR", but not "Rear" because there are characters that follow after "R". I remember someone showing me something like this before, but can't find it. It was something akin to \n (but wasn't \n, which is a new line, there is no new line), but could be put at the end of a string to denote no further characters (There was ether one for the start... may have been the same thing for start or end).
test="New_R_SideR_Rear_Object"
tokens=test.split("_")
newtest=""
for each in tokens:
if "R" in each:
each=each.replace("R", "L")
newtest=(newtest+each+"_")
I'm positive there is something I can add to the end of the "if "R" in each" line, or the .replace line, that will allow me to ensure that "Rear" doesn't become "Lear", but both "R" and "SideR" doe get replaced.
The above is just simplified for ease of explanation. Thanks for your time
You can use a regular expression. The regular expression language provides a compact way to express how to match text. For your example:
$ python3
>>> import re
>>> test="New_R_SideR_Rear_Object"
>>> re.sub(r"R(_|\b)", r"L\1", test)
'New_L_SideL_Rear_Object'
>>>
I am currently having trouble removing the end of strings using regex. I have tried using .partition with unsuccessful results. I am now trying to use regex unsuccessfully. All the strings follow the format of some random words **X*.* Some more words. Where * is a digit and X is a literal X. For Example 21X2.5. Everything after this dynamic string should be removed. I am trying to use re.sub('\d\d\X\d.\d', string). Can someone point me in the right direction with regex and how to split the string?
The expected output should read:
some random words 21X2.5
Thanks!
Use following regex:
re.search("(.*?\d\dX\d\.\d)", "some random words 21X2.5 Some more words").groups()[0]
Output:
'some random words 21X2.5'
Your regex is not correct. The biggest problem is that you need to escape the period. Otherwise, the regex treats the period as a match to any character. To match just that pattern, you can use something like:
re.findall('[\d]{2}X\d\.\d', 'asb12X4.4abc')
[\d]{2} matches a sequence of two integers, X matches the literal X, \d matches a single integer, \. matches the literal ., and \d matches the final integer.
This will match and return only 12X4.4.
It sounds like you instead want to remove everything after the matched expression. To get your desired output, you can do something like:
re.split('(.*?[\d]{2}X\d\.\d)', 'some random words 21X2.5 Some more words')[1]
which will return some random words 21X2.5. This expression pulls everything before and including the matched regex and returns it, discarding the end.
Let me know if this works.
To remove everything after the pattern, i.e do exactly as you say...:
s = re.sub(r'(\d\dX\d\.\d).*', r'\1', s)
Of course, if you mean something else than what you said, something different will be needed! E.g if you want to also remove the pattern itself, not just (as you said) what's after it:
s = re.sub(r'\d\dX\d\.\d.*', r'', s)
and so forth, depending on what, exactly, are your specs!-)
I want to write a regex to check if a word ends in anything except s,x,y,z,ch,sh or a vowel, followed by an s. Here's my failed attempt:
re.match(r".*[^ s|x|y|z|ch|sh|a|e|i|o|u]s",s)
What is the correct way to complement a group of characters?
Non-regex solution using str.endswith:
>>> from itertools import product
>>> tup = tuple(''.join(x) for x in product(('s','x','y','z','ch','sh'), 's'))
>>> 'foochf'.endswith(tup)
False
>>> 'foochs'.endswith(tup)
True
[^ s|x|y|z|ch|sh|a|e|i|o|u]
This is an inverted character class. Character classes match single characters, so in your case, it will match any character, except one of these: acehiosuxyz |. Note that it will not respect compound groups like ch and sh and the | are actually interpreted as pipe characters which just appear multiple time in the character class (where duplicates are just ignored).
So this is actually equivalent to the following character class:
[^acehiosuxyz |]
Instead, you will have to use a negative look behind to make sure that a trailing s is not preceded by any of the character sequences:
.*(?<!.[ sxyzaeiou]|ch|sh)s
This one has the problem that it will not be able to match two character words, as, to be able to use look behinds, the look behind needs to have a fixed size. And to include both the single characters and the two-character groups in the look behind, I had to add another character to the single character matches. You can however use two separate look behinds instead:
.*(?<![ sxyzaeiou])(?<!ch|sh)s
As LarsH mentioned in the comments, if you really want to match words that end with this, you should add some kind of boundary at the end of the expression. If you want to match the end of the string/line, you should add a $, and otherwise you should at least add a word boundary \b to make sure that the word actually ends there.
It looks like you need a negative lookbehind here:
import re
rx = r'(?<![sxyzaeiou])(?<!ch|sh)s$'
print re.search(rx, 'bots') # ok
print re.search(rx, 'boxs') # None
Note that re doesn't support variable-width LBs, therefore you need two of them.
How about
re.search("([^sxyzaeiouh]|[^cs]h)s$", s)
Using search() instead of match() means the match doesn't have to begin at the beginning of the string, so we can eliminate the .*.
This is assuming that the end of the word is the end of the string; i.e. we don't have to check for a word boundary.
It also assumes that you don't need to match the "word" hs, even it conforms literally to your rules. If you want to match that as well, you could add another alternative:
re.search("([^sxyzaeiouh]|[^cs]|^h)s$", s)
But again, we're assuming that the beginning of the word is the beginning of the string.
Note that the raw string notation, r"...", is unecessary here (but harmless). It only helps when you have backslashes in the regexp, so that you don't have to escape them in the string notation.
I'm trying to capture all the remaining text in a file after three hyphens at the start of a line (---).
Example:
Anything above this first set of hyphens should not be captured.
---
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
Everything after the first set of three hyphens should be captured. The closest I've gotten is using this regex [^(---)]+$ which works slightly. It will capture everything after the hyphens, but if the user places any hyphens after that point it instead then captures after the last hyphen the user placed.
I am using this in combination with python to capture text.
If anyone can help me sort out this regex problem I'd appreciate it.
pat = re.compile(r'(?ms)^---(.*)\Z')
The (?ms) adds the MULTILINE and DOTALL flags.
The MULTILINE flag makes ^ match the beginning of lines (not just the beginning of the string.) We need this because the --- occurs at the beginning of a line, but not necessarily the beginning of the string.
The DOTALL flag makes . match any character, including newlines. We need this so that (.*) can match more than one line.
\Z matches the end of the string (as opposed to the end of a line).
For example,
import re
text = '''\
Anything above this first set of hyphens should not be captured.
---
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
'''
pat = re.compile(r'(?ms)^---(.*)\Z')
print(re.search(pat, text).group(1))
prints
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
Note that when you define a regex character class with brackets, [...], the stuff inside the brackets are (in general, except for hyphenated ranges like a-z) interpreted as single characters. They are not patterns. So [---] is not different than [-]. In fact, [---] is the range of characters from - to -, inclusive.
The parenthese inside the character class are interpreted as literal parentheses too, not grouping delimiters. So [(---)] is equivalent to [-()], the character class including the hyphen and left and right parentheses.
Thus the character class [^(---)]+ matches any character other than the hyphen or parentheses:
In [23]: re.search('[^(---)]+', 'foo - bar').group()
Out[23]: 'foo '
In [24]: re.search('[^(---)]+', 'foo ( bar').group()
Out[24]: 'foo '
You can see where this is going, and why it does not work for your problem.
Sorry for not directly answering your question, but I wonder if regular expressions are overcomplicating the problem? You could do something like this:
f = open('myfile', 'r')
for i in f:
if i[:3] == "---":
break
text = f.readlines()
f.close()
Or, am I missing something?
I tend to find that regular expressions are difficult enough to maintain that if you don't need their unique capabilities for a given purpose it'll be cleaner and more readable to avoid using them entirely.
s = open(myfile).read().split('\n\n---\n\n', 1)
print s[0] # first part
print s[1] # second part after the dashes
This should work for your example. The second parameter to split specifies how many times to split the string.