Trying to split string with regex - python

I'm trying to split a string in Python using a regex pattern but its not working correctly.
Example text:
"The quick {brown fox} jumped over the {lazy} dog"
Code:
"The quick {brown fox} jumped over the {lazy} dog".split(r'({.*?}))
I'm using a capture group so that the split delimiters are retained in the array.
Desired result:
['The quick', '{brown fox}', 'jumped over the', '{lazy}', 'dog']
Actual result:
['The quick {brown fox} jumped over the {lazy} dog']
As you can see there is clearly not a match as it doesn't split the string. Can anyone let me know where I'm going wrong? Thanks.

You're calling the strings' split method, not re's
>>> re.split(r'({.*?})', "The quick {brown fox} jumped over the {lazy} dog")
['The quick ', '{brown fox}', ' jumped over the ', '{lazy}', ' dog']

Related

Reformatting list cases of a string/word with its format from another list

The goal of this is to replace all instances of elements from L1 with their respective elements of of a different format in L2.
For instance:
L1 = ['apple', 'some_fruit', 'BaNaNa', 'ORANGE_123']
L2 = ['The quick brown BANANA jumped over the lazy (APPLE)\n',
'Then the <orange_123> got SOME_FRUIT\n', 'The End.']
**add code**
output = ''.join(L2)
print(output)
> The quick brown BaNaNa jumped over the lazy (apple)
> Then the <ORANGE_123> got some_fruit
> The End.
The goal is for the output to replace all of the instances within L2 with its similar case in L1 but to reformat the case (upper/lower) for the characters as to fit the format they are in within L1.
I know this is not all that straight forward, so if further explanation is needed/ more examples are needed please let me know.
Note: I am trying to convert txt files to a new formatting, and L1 represents the correct format for specific words that I need to reformat, and L2 represents all of the lines read by .readlines from a txt file.
Try:
import re
L3=[]
for el2 in L2:
for el1 in L1:
el2=re.sub(el1, el1, el2, flags=re.IGNORECASE)
L3.append(el2)
Outputs:
#L2:
['The quick brown BANANA jumped over the lazy (APPLE)\n', 'Then the <orange_123> got SOME_FRUIT\n', 'The End.']
#L3:
['The quick brown BaNaNa jumped over the lazy (apple)\n', 'Then the <ORANGE_123> got some_fruit\n', 'The End.']

Regular expression for returning lines of dialogue

-I am a beginner python coder so bear with me!
A line of complete dialog is defined as text that starts on its own line and starts and ends with double quotation marks (i.e. ").
what i have so far is,
def q_4():
pattern = r'^\"\w*\"'
return re.compile(pattern, re.M|re.IGNORECASE)
but for some reason it only returns one instance with one word between the two double quotes. How can i go about grasping full lines?
Try searching on the pattern \"[^"]+\":
inp = """Here is a quote: "the quick brown fox jumps over
the lazy dog" and here is another "blah
blah blah" the end"""
dialogs = re.findall(r'\"([^"]+)\"', inp)
print(dialogs)
This prints:
['the quick brown fox jumps over\nthe lazy dog', 'blah\nblah blah']

Python regex replace all patterns except when it is next to a repeated pattern

I am using Python and I have a multi-line string that looks like:
The quick brown fox jumps over the lazy dog.
The quick quick brown fox jumps over the quick lazy dog. This a very very very very long line.
This line has other text?
The quick quick brown fox jumps over the quick lazy dog.
I would like to replace all occurrences of quick with slow but with one exception. When quick is proceeded by quick then only the first quick is converted by the second, neighboring quick is left unchanged.
So, the output should look like this:
The slow brown fox jumps over the lazy dog.
The slow quick brown fox jumps over the slow lazy dog. This a very very very very long line.
This line has other text?
The slow quick brown fox jumps over the slow lazy dog.
I can do this using multiple passes where I first convert everything to slow and then convert the edge case during my second pass. But I'm hoping that there is a more elegant or obvious one-pass solution.
Here's a variant for regex engines that do not support look-aheads:
quick(( quick)*)
replaced by
slow\1
Here is one way using re.sub using a negative lookbehind to replace quick when not preceded by the same substring:
import re
re.sub(r'(?<!quick\s)quick', 'slow', s)
Using the shared examples:
s1 = 'The quick brown fox jumps over the lazy dog. '
s2 = 'The quick quick brown fox jumps over the quick lazy dog. This a very very very very long line.'
re.sub(r'(?<!quick\s)quick', 'slow', s1)
# 'The slow brown fox jumps over the lazy dog. '
re.sub(r'(?<!quick\s)quick', 'slow', s2)
# 'The slow quick brown fox jumps over the slow lazy dog. This a very very very very long line.'
Regex breakdown:
(?<!quick\s)quick
Negative Lookbehind (?<!quick\s)
quick matches the characters quick literally (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
quick matches the characters quick literally (case sensitive)
You could harness grouping for this task, following way:
import re
txt1 = 'The quick brown fox jumps over the lazy dog.'
txt2 = 'The quick quick brown fox jumps over the quick lazy dog.'
out1 = re.sub(r'(quick)((\squick)*)',r'lazy\2',txt1)
out2 = re.sub(r'(quick)((\squick)*)',r'lazy\2',txt2)
print(out1) # The lazy brown fox jumps over the lazy dog.
print(out2) # The lazy quick brown fox jumps over the lazy lazy dog.
Idea is pretty simple: 1st group for first quick and 2nd group for rest quicks. Then replace it with lazy and content of 2nd group.

the separators are not working properly in regular expression split method

import re
text = 'The quick. black n brown? fox jumps*over the lazy dog.'
print(re.split('; |, |\? |. ',text))
This is giving me an output of:
['Th', 'quick', 'brown', 'fo', 'jumps*ove', 'th', 'laz', 'dog.']
but I want that string to be split as
['The quick.', 'black n brown?', 'fox jumps*over the lazy dog.']
If I understood what you needed, your regex should have the dot escaped:
print(re.split('; |, |\? |\. ',text)
You can leverage a zero-width positive lookbehind here:
re.split('(?<=[;,.?]) ',text)
(?<=[;,.?]) is zero-width positive lookbehind that matches any of ;, ,, ., ? literally; this is followed by a space to be matched
Example:
In [1461]: text = 'The quick. black n brown? fox jumps*over the lazy dog.'
In [1462]: re.split(r'(?<=[;,.?]) ',text)
Out[1462]: ['The quick.', 'black n brown?', 'fox jumps*over the lazy dog.']
In your try, if you replace . (any character) with a escaped version to get litaral . i.e. \. you would get closer to the desired output:
In [1463]: text = 'The quick. black n brown? fox jumps*over the lazy dog.'
In [1464]: re.split(r'; |, |\? |. ',text)
Out[1464]: ['Th', 'quick', 'blac', '', 'brown', 'fo', 'jumps*ove', 'th', 'laz', 'dog.']
In [1465]: re.split(r'; |, |\? |\. ',text)
Out[1465]: ['The quick', 'black n brown', 'fox jumps*over the lazy dog.']
As all the patterns have single characters followed by a space, you can make the pattern more compact by using character class:
In [1466]: re.split(r'[;,?.] ',text)
Out[1466]: ['The quick', 'black n brown', 'fox jumps*over the lazy dog.']
You don't need to escape Regex tokens inside character class [].
Also, make Regex patterns raw by enclosing the pattern string with r.

Python Regex findall But Not Including the conditional string

i have this string:
The quick red fox jumped over the lazy brown dog lazy
And i wrote this regex which gives me this:
s = The quick red fox jumped over the lazy brown dog lazy
re.findall(r'[\s\w\S]*?(?=lazy)', ss)
which gives me below output:
['The quick red fox jumped over the ', '', 'azy brown dog ', '']
But i am trying to get the output like this:
['The quick red fox jumped over the ']
Which means the regex should give me everything till it encounters the first lazy instead of last one and i only want to use findall.
Make the pattern non-greedy by adding a ?:
>>> m = re.search(r'[\s\w\S]*?(?=lazy)', s)
# ^
>>> m.group()
'The quick red fox jumped over the '

Categories

Resources