Python Regex findall dot + newline - python

I am trying to extract a paragraph containing a certain string using Python. Example:
text = """test textract.
new line
test word.
another line."""
The following code works:
myword = ("word")
re.findall(r'(?<=(\n)).*?'+ myword + r'+.*?(?=(\n))',text)
and will return:
['test word.']
However, if i want to extract ['new line test word.'], none of the following works:
re.findall(r'(?<=(\.\n)).*?'+ myword + r'+.*?(?=(\.\n))',text) -> []
re.findall(r'(?<=(\.\n)).|\n*?'+ myword + r'+.|\n*?(?=(\.\n))',text) -> [('', '.\n'), ('', '.\n')]
re.findall(r'(?<=(\.\n)).*|\n*?'+ myword + r'+.*|\n*?(?=(\.\n))',text) -> [('', '.\n'), ('.\n', ''), ('', '.\n'), ('.\n', '')]
What should be the right way to do this?

You are matching a single line, because the newlines are asserted and not matched.
What you could do is use an anchor ^ and repeat all lines that have at least a single non whitespace char and match at least a line that contains word
You can also start the pattern with a newline, but then it could miss a first paragraph that is not preceded by a newline.
^(?:[^\S\n]*\S.*\n)*.*word.*(?:\n[^\S\n]*\S.*)*
Regex demo
import re
text = """test textract.
new line
test word.
another line."""
pattern = r"^(?:[^\S\n]*\S.*\n)*.*word.*(?:\n[^\S\n]*\S.*)*"
print(re.findall(pattern, text, re.MULTILINE))
Output
['new line\ntest word.']

You need to use re.MULTILINE and re.DOTALL here to analyse the whole text as a single line, and treat newlines as regular characters:
import re
text = """\
test textract.
new line
test word.
another line."""
print(re.findall(r'\n+(.*word.*)\n+', text, re.MULTILINE | re.DOTALL))
Output:
['new line\ntest word.\n']

Related

Replace string with same amount of white spaces in search pattern

I would like to keep the same amount of white space in my string but replace 'no' with empty string. Here is my code:
>>> import re
>>> line = ' no command'
>>> line = re.sub('^\s.+no','',line)
>>> line
' command'
Expected output without only 'no' replaced with empty string and the number of white spaces remain:
' command'
Using .+ matches any character, so if you only want to match 1 or more whitespace characters you should repeat \s+ instead.
Then capture that in group 1 to be able to use that in the replacement and match no followed by a word boundary.
Note that \s can also match a newline.
import re
line = ' no command'
line = re.sub(r'^(\s+)no\b', r'\1', line)
print(line)
Output
command
I think you want a positive look behind. It doesn't quite do exactly what your code does in terms of finding a match at the beginning of the line, but hopefully this sets you on the right track.
>>> line = ' no command'
>>> re.sub('(?<=\s)no','',line)
' command'
You can also capture the preceding text.
>>> re.sub('^(\s.+)no', r'\1', line)
' command'
what about using the replace method of the string class?
new_line = line.replace("no", "")

Can string replace be written in list comprehension?

I have a text and a list.
text = "Some texts [remove me] that I want to [and remove me] replace"
remove_list = ["[remove me]", "[and remove me]"]
I want to replace all elements from list in the string. So, I can do this:
for element in remove_list:
text = text.replace(element, '')
I can also use regex.
But can this be done in list comprehension or any single liner?
You can use functools.reduce:
from functools import reduce
text = reduce(lambda x, y: x.replace(y, ''), remove_list, text)
# 'Some texts that I want to replace'
I would do this with re.sub to remove all the substrings in one pass:
>>> import re
>>> regex = '|'.join(map(re.escape, remove_list))
>>> re.sub(regex, '', text)
'Some texts that I want to replace'
Note that the result has two spaces instead of one where each part was removed. If you want each occurrence to leave just one space, you can use a slightly more complicated regex:
>>> re.sub(r'\s*(' + regex + r')', '', text)
'Some texts that I want to replace'
There are other ways to write similar regexes; this one will remove the space preceding a match, but you could alternatively remove the space following a match instead. Which behaviour you want will depend on your use-case.
You can do this with a regex by building a regex from an alternation of the words to remove, taking care to escape the strings so that the [ and ] in them don't get treated as special characters:
import re
text = "Some texts [remove me] that I want to [and remove me] replace"
remove_list = ["[remove me]", "[and remove me]"]
regex = re.compile('|'.join(re.escape(r) for r in remove_list))
text = regex.sub('', text)
print(text)
Output:
Some texts that I want to replace
Since this may result in double spaces in the result string, you can remove them with replace e.g.
text = regex.sub('', text).replace(' ', ' ')
Output:
Some texts that I want to replace

Having a problem with Python Regex: Prints "None" when printing "matches". Regex works in tester

I'm supposed to extract groups of text from a file with a top ten list: name, rank, etc. for each. You can see the file and the regex here https://regex101.com/r/fXK5YV/1. It works in there and you can see the capturing groups.
import re
pattern = '''
(?P<list><li\sclass="regular-search-result">(.|\n)*?(?<=\<span class=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span class=\"review-count rating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.match(yelp_html)
This prints None.
There is definitely text inside of yelp_html.
What am I doing wrong?
I see two issues:
You're not using a raw string (prefix the string with an r), which means that your backslashes are going to be trying to represent special things instead of being part of the string.
I believe your multiline string is going to be attempting to match both the newlines between each line and the spaces at the start of the string into your regex (which you don't want, given this is not how the regex is formatted in your link).
import re
pattern = r'''
(?P<list><li\sclass=\"regular-search-result\">(.|\n)*?(?<=\<span\sclass=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span\sclass=\"review-count\srating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})
(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.finditer(yelp_html)
for item in matches:
print(item.group('rank', 'name', 'stars', 'numrevs', 'price'))

Find the Words that Created by Removing Letters from Given String

I'm trying to write a code using regex and my text file. My file contains these words line by line:
each
expressions
flags
in
from
given
line
of
once
lines
no
My purpose is; displaying the words that created by removing letters from given substring.
For example; if my substring is "flamingoes", my output should be;
flags
in
line
lines
no
Because they are created from my substring by removing letters, and they are in my text file also.
I did many works about regex but I am interested about this challenge. Is there any regex solution for this?
You should create a regex for each word you are looking for. The expression .*? between each letter is a non-greedy pattern, which will avoid backtracking (at least some of it), and make the search faster.
For example, the regex for the word "given" would be g.*?i.*?v.*?e.*?n
import re
def hidden_words(needles, haystack):
for needle in needles:
regex = re.compile(('.*?').join(list(needle)))
if regex.search(haystack):
yield needle
needles = ['each', 'expressions', 'flags', 'in', 'from',
'given', 'line', 'of', 'once', 'lines', 'no']
print(*hidden_words(needles, 'flamingoes'), sep='\n')
Essentially each character is optional. A simple
import re
word = 'flamingoes'
pattern = ''.join( c+'?' for c in word ) # ? Marks the letter as optional
for line in open('file').readLines():
line = line.strip()
m = re.match(pattern, line)
if m:
print(line)
Should suffice

Delete the repetition of a specific word in a row

For example I have a string:
my_str = 'my example example string contains example some text'
What I want to do - delete all duplicates of specific word (only if they goes in a row). Result:
my example string contains example some text
I tried next code:
import re
my_str = re.sub(' example +', ' example ', my_str)
or
my_str = re.sub('\[ example ]+', ' example ', my_str)
But it doesn't work.
I know there are a lot of questions about re, but I still can't implement them to my case correctly.
You need to create a group and quantify it:
import re
my_str = 'my example example string contains example some text'
my_str = re.sub(r'\b(example)(?:\s+\1)+\b', r'\1', my_str)
print(my_str) # => my example string contains example some text
# To build the pattern dynamically, if your word is not static
word = "example"
my_str = re.sub(r'(?<!\w)({})(?:\s+\1)+(?!\w)'.format(re.escape(word)), r'\1', my_str)
See the Python demo
I added word boundaries as - judging by the spaces in the original code - whole word matches are expected.
See the regex demo here:
\b - word boundary (replaced with (?<!\w) - no word char before the current position is allowed - in the dynamic approach since re.escape might also support "words" like .word. and then \b might stop the regex from matching)
(example) - Group 1 (referred to with \1 from the replacement pattern):
the example word
(?:\s+\1)+ - 1 or more occurrences of
\s+ - 1+ whitespaces
\1 - a backreference to the Group 1 value, that is, an example word
\b - word boundary (replaced with (?!\w) - no word char after the current position is allowed).
Remember that in Python 2.x, you need to use re.U if you need to make \b word boundary Unicode-aware.
Regex: \b(\w+)(?:\s+\1)+\b or \b(example)(?:\s+\1)+\b Substitution: \1
Details:
\b Assert position at a word boundary
\w Matches any word character (equal to [a-zA-Z0-9_])
\s Matches any whitespace character
+ Matches between one and unlimited times
\1 Group 1.
Python code:
text = 'my example example string contains example some text'
text = re.sub(r'\b(\w+)(?:\s+\1)+\b', r'\1', text)
Output:
my example string contains example some text
Code demo
You could also do this in pure Python (without a regex), by creating a list of words and then generating a new string - applying your rules.
>>> words = my_str.split()
>>> ' '.join(w for i, w in enumerate(words) if w != words[i-1] or i == 0)
'my example string contains example some text'
Why not use the .replace function:
my_str = 'my example example string contains example some text'
print my_str.replace("example example", "example")

Categories

Resources