Python Regex findall But Not Including the conditional string - python

i have this string:
The quick red fox jumped over the lazy brown dog lazy
And i wrote this regex which gives me this:
s = The quick red fox jumped over the lazy brown dog lazy
re.findall(r'[\s\w\S]*?(?=lazy)', ss)
which gives me below output:
['The quick red fox jumped over the ', '', 'azy brown dog ', '']
But i am trying to get the output like this:
['The quick red fox jumped over the ']
Which means the regex should give me everything till it encounters the first lazy instead of last one and i only want to use findall.

Make the pattern non-greedy by adding a ?:
>>> m = re.search(r'[\s\w\S]*?(?=lazy)', s)
# ^
>>> m.group()
'The quick red fox jumped over the '

Related

How to Change the color of certain words in a cell

for example: The quick brown fox jumps over a lazy dog.

Python regex replace all patterns except when it is next to a repeated pattern

I am using Python and I have a multi-line string that looks like:
The quick brown fox jumps over the lazy dog.
The quick quick brown fox jumps over the quick lazy dog. This a very very very very long line.
This line has other text?
The quick quick brown fox jumps over the quick lazy dog.
I would like to replace all occurrences of quick with slow but with one exception. When quick is proceeded by quick then only the first quick is converted by the second, neighboring quick is left unchanged.
So, the output should look like this:
The slow brown fox jumps over the lazy dog.
The slow quick brown fox jumps over the slow lazy dog. This a very very very very long line.
This line has other text?
The slow quick brown fox jumps over the slow lazy dog.
I can do this using multiple passes where I first convert everything to slow and then convert the edge case during my second pass. But I'm hoping that there is a more elegant or obvious one-pass solution.
Here's a variant for regex engines that do not support look-aheads:
quick(( quick)*)
replaced by
slow\1
Here is one way using re.sub using a negative lookbehind to replace quick when not preceded by the same substring:
import re
re.sub(r'(?<!quick\s)quick', 'slow', s)
Using the shared examples:
s1 = 'The quick brown fox jumps over the lazy dog. '
s2 = 'The quick quick brown fox jumps over the quick lazy dog. This a very very very very long line.'
re.sub(r'(?<!quick\s)quick', 'slow', s1)
# 'The slow brown fox jumps over the lazy dog. '
re.sub(r'(?<!quick\s)quick', 'slow', s2)
# 'The slow quick brown fox jumps over the slow lazy dog. This a very very very very long line.'
Regex breakdown:
(?<!quick\s)quick
Negative Lookbehind (?<!quick\s)
quick matches the characters quick literally (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
quick matches the characters quick literally (case sensitive)
You could harness grouping for this task, following way:
import re
txt1 = 'The quick brown fox jumps over the lazy dog.'
txt2 = 'The quick quick brown fox jumps over the quick lazy dog.'
out1 = re.sub(r'(quick)((\squick)*)',r'lazy\2',txt1)
out2 = re.sub(r'(quick)((\squick)*)',r'lazy\2',txt2)
print(out1) # The lazy brown fox jumps over the lazy dog.
print(out2) # The lazy quick brown fox jumps over the lazy lazy dog.
Idea is pretty simple: 1st group for first quick and 2nd group for rest quicks. Then replace it with lazy and content of 2nd group.

How can I remove non characters from a dataframe? python beautiful soup

I have a dataframe
df
ID col1
1 The quick brown fox jumped hf_093*&
2 fox run jump *& #7
How can I parse out non-characters in this dataframe?
I tried this but it doesn't work
posts = ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)","
",posts).split())
You could use the inbuilt functions:
import pandas as pd
df = pd.DataFrame({'ID': [1,2], 'col1': ['The quick brown fox jumped hf_093*&', 'fox run jump *& #7']}).set_index('ID')
df['col1'] = df['col1'].str.replace('[^\w\s]+', '')
print(df)
Which yields
col1
ID
1 The quick brown fox jumped hf_093
2 fox run jump 7
This removes everything not [a-zA-Z0-9_] and whitespaces.
If you want finer control, you could use a function
import re
rx = re.compile(r'(?i)\b[a-z]+\b')
def remover(row):
words = " ".join([word
for word in row.split()
if rx.match(word)])
return words
df['col1'] = df['col1'].apply(remover)
print(df)
Which would yield
col1
ID
1 The quick brown fox jumped
2 fox run jump
If what you're looking for is removing the strings that contains special characters:
Regex:
df.applymap(lambda x: re.sub("(?:\w*[^\w ]+\w*)", "", x).strip())
Output:
0
0 The quick brown fox jumped
1 fox run jump
An alternative, non-regex solution for the crazy list comprehension enthusiasts:
unwanted = '!##$%^&*()'
df.applymap(lambda x: ' '.join([i for i in x.split() if not any(c in i for c in unwanted)]))
Output:
0
0 The quick brown fox jumped
1 fox run jump
Removes any strings that has the unwanted special characters in them.

Trying to split string with regex

I'm trying to split a string in Python using a regex pattern but its not working correctly.
Example text:
"The quick {brown fox} jumped over the {lazy} dog"
Code:
"The quick {brown fox} jumped over the {lazy} dog".split(r'({.*?}))
I'm using a capture group so that the split delimiters are retained in the array.
Desired result:
['The quick', '{brown fox}', 'jumped over the', '{lazy}', 'dog']
Actual result:
['The quick {brown fox} jumped over the {lazy} dog']
As you can see there is clearly not a match as it doesn't split the string. Can anyone let me know where I'm going wrong? Thanks.
You're calling the strings' split method, not re's
>>> re.split(r'({.*?})', "The quick {brown fox} jumped over the {lazy} dog")
['The quick ', '{brown fox}', ' jumped over the ', '{lazy}', ' dog']

How to wrap long lines in a text using Regular Expressions when you also need to indent the wrapped lines?

How can one change the following text
The quick brown fox jumps over the lazy dog.
to
The quick brown fox +
jumps over the +
lazy dog.
using regex?
UPDATE1
A solution for Ruby is still missing... A simple one I came to so far is
def textwrap text, width, indent="\n"
return text.split("\n").collect do |line|
line.scan( /(.{1,#{width}})(\s+|$)/ ).collect{|a|a[0]}.join indent
end.join("\n")
end
puts textwrap 'The quick brown fox jumps over the lazy dog.', width=19, indent=" + \n "
# >> The quick brown fox +
# >> jumps over the lazy +
# >> dog.
Maybe use textwrap instead of regex:
import textwrap
text='The quick brown fox jumps over the lazy dog.'
print(' + \n'.join(
textwrap.wrap(text, initial_indent='', subsequent_indent=' '*4, width=20)))
yields:
The quick brown fox +
jumps over the +
lazy dog.

Categories

Resources