I have this:
myText = str(^123"I like to"^456&U"play video games and"$"eat cereal")
I want to extract everything in between (and including) quotation marks, split everything before and after the $ sign, and append them into a nested list. E.g.
myTextList = [["I like to","play video games and"],["eat cereal"]]
This is what I tried:
tempTextList = []
for text in re.findall('(?<=\$)"[^"]*"(?<!\^)',myText,re.DOTALL)
tempTextList.append(text)
myTextList.append(tempTextList)
I used the website https://www.regex101.com/#python and tried almost everything I could think of...
(?!\$)"(?!\^\00\+\-\&)[^"].*"
etc...
The re.findall part doesn't really work the way I want it to.
Can someone point me in the right direction?
Thanks
You can use "[^"]*" regex with re.findall:
import re
s = 'myText = str(^123"I like to"^456&U"play video games and"$"eat cereal")'
print(re.findall(r'"[^"]*"', s))
See demo
It matches the double quoted substrings you need with double quotes: ['"I like to"', '"play video games and"', '"eat cereal"'].
Note that "[^"]*" matches " followed by zero or more characters other than " followed with ".
If you need to get the contents inside "..." without the double quotes, you can use capturing mechanism:
r'"([^"]*)"'
The re.findall will only return the captures in Group 1. See another demo.
Related
I am given a string which is number example "44.87" or "44.8796". I want to extract everything after decimal (.). I tried to use regex in Python code but was not successful. I am new to Python 3.
import re
s = "44.123"
re.findall(".","44.86")
Something like s.split('.')[1] should work
If you would like to use regex try:
import re
s = "44.123"
regex_pattern = "(?<=\.).*"
matched_string = re.findall(regex_pattern, s)
?<= a negative look behind that returns everything after specified character
\. is an escaped period
.* means "match all items after the period
This online regex tool is a helpful way to test your regex as you build it. You can confirm this solution there! :)
I am trying to find a way to parse a string of a transcript into speaker segments (as a list). Speaker labels are denoted by the upper-casing of the speaker's name followed by a colon. The problem I am having is some names have a number of non upper-case characters. Examples might include the following:
OBAMA: said something
O'MALLEY: said something else
GOV. HICKENLOOPER: said something else entirely'
I have written the following regex, but I am struggling to get it to work:
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+(\ |\.|\'|\d)*[A-Z]*:)', mystring)
What I think I have written (and ideally what I want to do) is a command to split the string based on:
1. Find a newline
2. Use positive look-ahead for one or more uppercase characters
3. If upper-case characters are found look for optional characters from the list of periods, apostrophes, single spaces, and digits
4. If these optional characters are found, look for additional uppercase characters.
5. Crucially, find a colon symbol at the end of this sequence.
EDIT: In many cases, the content of the speech will have newline characters contained within it, and possibly colon symbols. As such, the only thing separating the speaker label from the content of speech is the sequence mentioned above.
just change (\ |.|\'|\d) to [\ .\'\d] or (?:\ |.|\'|\d)
import re
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+[\ \.\'\d]*[A-Z]*:)', mystring)
print(parse_turns)
If it's true that the speaker's name and what they said are separated by a colon, then it might be simpler to move away from regex to do your splitting.
list_of_things = []
mystring = "OBAMA: Hi\nO'MALLEY: True Dat\nHUCK FINN: Sure thing\n"
lines = mystring.split("\n")# 1st split the string into lines based on the \n character
for line in lines:
colon_pos = line.find(":",0) # Finds the position of the first colon in the line
speaker, utterance = line[0:colon_pos].strip(), line[colon_pos+1:].strip()
list_of_things.append((speaker, utterance))
At the end, you should have a neat list of tuples containing speakers, and the things they said.
I have a long string like this:
Page Content
Director, Research Center.
Director of Research, Professor
Researcher
Lines end in a double newline. Some contain period in the end, some don't. I want each that had a double newline one to contain a single period and a single new line, like this:
Page Content.
Director, Research Center.
Director of Research, Professor.
Researcher.
There are also lines which end with a period and a single newline and they should stay the way they are. What I've tried:
re.sub('(?!\.)\n\n', '.\n', text)
What I'm trying to do is a negative on the period followed by two newlines, or find every single double new line that doesn't have a period right before and replace it with a period and a single newline.
I've tried some other variations, but I always end up with either double period or no changes.
You could use a negative lookbehind instead to assert what is on the left is not a dot. Escape the dot \. to match it literally.
(?<!\.)\n\n
Regex demo
Or to match an optional \r you could use a quantifier to repeat a non capturing group:
(?<!\.)(?:\r?\n){2}
Regex demo
Not very elegant, but obviously working:
text = text.replace('\.\n\n', '\n\n').replace('\n\n', '.\n')
If you insist on using re.sub:
text = re.sub('([^.])\.?\n\n', r'\1.\n', text)
This is downright ugly, but works too.
I need to replace all dots in a string which are enclosed by dollar signs.
There is no nested structure so I think regular expressions are the right tool for this.
An example string looks like this:
asdf $asdf.asdf.$ $..asdf$
The regex I came up with matches the part within the dollar signs, but I want a match for each dot within the dollar signs (example):
\$([^$]*)\$
so for the example string it should yield four matches. How can I achieve that?
Since you are using Python, the easiest solution is to use your pattern to match the substrings from $ to $, and replace . with anything you want with a lambda:
import re
s = "a.sdf $asdf.asdf.$. . .$..asdf$"
r = re.compile(r'\$([^$]*)\$')
print(r.sub(lambda m: m.group().replace('.',''), s))
# => a.sdf $asdfasdf$. . .$asdf$
See the IDEONE demo
I think this will be very hard with a regex, since you have to count the dollar signs in some sense (you want only to call every second gap between two dollar signs "enclosed", and the others "outside" right?)
So it seems easier to (python example)
do a mystring.split('$'), which gives you a list
then take every second item, e.g. newlist = oldlist[1::2]
count the dots (''.join(newlist).count('.'))
I'm looking for an OR capability to match on several strings with regular expressions.
# I would like to find either "-hex", "-mos", or "-sig"
# the result would be -hex, -mos, or -sig
# You see I want to get rid of the double quotes around these three strings.
# Other double quoting is OK.
# I'd like something like.
messWithCommandArgs = ' -o {} "-sig" "-r" "-sip" '
messWithCommandArgs = re.sub(
r'"(-[hex|mos|sig])"',
r"\1",
messWithCommandArgs)
This works:
messWithCommandArgs = re.sub(
r'"(-sig)"',
r"\1",
messWithCommandArgs)
Square brackets are for character classes that can only match a single character. If you want to match multiple character alternatives you need to use a group (parentheses instead of square brackets). Try changing your regex to the following:
r'"(-(?:hex|mos|sig))"'
Note that I used a non-capturing group (?:...) because you don't need another capture group, but r'"(-(hex|mos|sig))"' would actually work the same way since \1 would still be everything but the quotes.
Alternative you could use r'"-(hex|mos|sig)"' and use r"-\1" as the replacement (since the - is no longer a part of the group.
You should remove [] metacharacters in order to match hex or mos or sig. (?:-(hex|mos|sig))