Period stops multiline regex substitute in Python? - python

I have multiple line string that I'd like to replace, but don't understand why it's not working. For some reason, a period in the string stops the matching for the regular expression.
My string:
s = """
[some_previous_text]
<start>
one_period .
<end>
[some_text_after]
"""
What I'd like to end up with:
s = """
[some_previous_text]
foo
[some_text_after]
"""
What I initially tried, but it doesn't match anything:
>>> import re
>>> s = "<start>\none_period .\n<end>"
>>> print re.sub("<start>[^.]*<end>", "foo", s)
<start>
one_period .
<end>
However, when I took the period out, it worked fine:
>>> import re
>>> s = "<start>\nno_period\n<end>"
>>> print re.sub("<start>[^.]*<end>", "foo", s)
foo
Also, when I put an <end> tag before the period, it matched the first <end> tag:
>>> import re
>>> s = "<start>\n<end>\none_period .\n<end>"
>>> print re.sub("<start>[^.]*<end>", "foo", s)
foo
one_period .
<end>
So what's going on here? Why does the period stop the [^.]* matching?
EDIT:
SOLVED
I mistakenly thought that the carat ^ was for new-line matching. What I needed was a re.DOTALL flag (as indicated by Amber). Here's the expression I'm now using:
>>> import re
>>> s = "<start>\none_period .\n<end>"
>>> print re.sub("<start>.*<end>", "foo", s, flags=re.DOTALL)
foo

Why wouldn't it? [^.] is "the set of all characters that is not a ." and thus doesn't match periods.
Perhaps you instead meant to just put .* (any number of any characters) instead of [^.]*?
For matching across newlines, specify re.DOTALL:
re.sub("<start>.*<end>", "foo", s, flags=re.DOTALL)

Thats because [^.]* is a negated character class that matches any character but a period.
You probably want something like <start>.*?<end> together with the re.S modifier, that makes the dot matches also newline characters.
re.sub("<start>.*?<end>", "foo", s, flags=re.S)

Related

Python regex: remove short lines

I have a string with multiple newline symbols:
text = 'foo\na\nb\n$\n\nxz\nbar'
I want to remove the lines that are shorter than 3 symbols. The desired output is
'foo\n\nbar'
I tried
re.sub(r'(\n([\s\S]{0,2})\n)+', '\nX\n', text, flags= re.S)
but this matches only some subset of the string and the result is
'foo\nX\nb\nX\nxz\nbar'
I need somehow to do greedy search and replace the longest string matching the pattern.
re.S makes . match everything including newline, and you don't want that. Instead use re.M so ^ matches beginning of string and after newline, and use:
>>> import re
>>> text = 'foo\na\nb\n$\n\nxz\nbar'
>>> re.findall('(?m)^.{0,2}\n',text)
['a\n', 'b\n', '$\n', '\n', 'xz\n']
>>> re.sub('(?m)^.{0,2}\n','',text)
'foo\nbar'
That's "from start of a line, match 0-2 non-newline characters, followed by a newline".
I noticed your desired output has a \n\n in it. If that isn't a mistake use .{1,2} if blank lines are to be left in.
You might also want to allow the final line of the string to have an optional terminating newline, for example:
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nbar') # 3 symbols at end, no newline
'foo\nbar'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nbar\n') # same, with newline
'foo\nbar\n'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nba\n') # <3 symbols, newline
'foo\n'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nba') # < 3 symbols, no newline
'foo\n'
Perhaps you can use re.findall instead:
text = 'foo\na\nb\n$\n\nxz\nbar'
import re
print (repr("".join(re.findall(r"\n?\w{3,}\n?",text))))
#
'foo\n\nbar'
You can use this regex, which looks for any set of less than 3 non-newline characters following either start-of-string or a newline and followed by a newline or end-of-string, and replace it with an empty string:
(^|\n)[^\n]{0,2}(?=\n|$)
In python:
import re
text = 'foo\na\nb\n$\n\nxz\nbar'
print(re.sub(r'(^|\n)[^\n]{0,2}(?=\n|$)', '', text))
Output
foo
bar
Demo on rextester
There's no need to use regex for this.
raw_str = 'foo\na\nb\n$\n\nxz\nbar'
str_res = '\n'.join([curr for curr in raw_str.splitlines() if len(curr) >= 3])
print(str_res):
foo
bar

python 3 regular expression match string meta-character

I want to write a line of regular expression that can match strings like "(2000)" with years in parentheses. then I can check if any string contains the substring "2000".
for example, I want the regex to match (2000) not 2000, or (20000),or (200).
That is to say: they have to have exactly four digits, the first digit between 1 and 2; they have to include the parentheses.
also 2000 is just an example I use but really I want to the regex to include all the possible years.
You have to escape the open and close paranthesis,
>>> import re
>>> str = """foo(2000)bar(1000)foobar2000"""
>>> regex = r'\(2000\)'
>>> result = re.findall(regex, str)
>>> print(result)
['(2000)']
OR
>>> import re
>>> str = """foo(2000)bar(1000)foobar(2014)barfoo(2020)"""
>>> regex = r'\([0-9]{4}\)'
>>> result = re.findall(regex, str)
>>> print(result)
['(2000)', '(1000)', '(2014)', '(2020)']
It matches all the four digit numbers(year's) present within the paranthesis.
Special characters need to be escaped with a backslash. A parenthesis ( becomes \(. Therefore (2000) becomes \(2000\).
Then you can do something like:
if re.search(r"\(2000\)", subject):
# Successful match
else:
# Match attempt failed
>>> import re
>>> x = re.match(r'\((\d*?)\)', "(2000)")
>>> x.group(1)
'2000'

Python - why doesn't this simple regex work?

This code below should be self explanatory. The regular expression is simple. Why doesn't it match?
>>> import re
>>> digit_regex = re.compile('\d')
>>> string = 'this is a string with a 4 digit in it'
>>> result = digit_regex.match(string)
>>> print result
None
Alternatively, this works:
>>> char_regex = re.compile('\w')
>>> result = char_regex.match(string)
>>> print result
<_sre.SRE_Match object at 0x10044e780>
Why does the second regex work, but not the first?
Here is what re.match() says If zero or more characters at the beginning of string match the regular expression pattern ...
In your case the string doesn't have any digit \d at the beginning. But for the \w it has t at the beginning at your string.
If you want to check for digit in your string using same mechanism, then add .* with your regex:
digit_regex = re.compile('.*\d')
The second finds a match because string starts with a word character. If you want to find matches within the string, use the search or findall methods (I see this was suggested in a comment too). Or change your regex (e.g. .*(\d).*) and use the .groups() method on the result.

Easiest way to replace a substring

What would be the easiest way to replace a substring within a string when I don't know the exact substring I am looking for and only know the delimiting strings? For example, if I have the following:
mystr = 'wordone wordtwo "In quotes"."Another word"'
I basically want to delete the first quoted words (including the quotes) and the period (.) following so the resulting string is:
'wordone wordtwo "Another word"'
Basically I want to delete the first quoted words and the quotes and the following period.
You are looking for regular expressions here, using the re module:
import re
quoted_plus_fullstop = re.compile(r'"[^"]+"\.')
result = quoted_plus_fullstop.sub('', mystr)
The pattern matches a literal quote, followed by 1 or more characters that are not quotes, followed by another quote and a full stop.
Demo:
>>> import re
>>> mystr = 'wordone wordtwo "In quotes"."Another word"'
>>> quoted_plus_fullstop = re.compile(r'"[^"]+"\.')
>>> quoted_plus_fullstop.sub('', mystr)
'wordone wordtwo "Another word"'

How do I extract certain parts of strings in Python?

Say I have three strings:
abc534loif
tvd645kgjf
tv96fjbd_gfgf
and three lists:
beginning captures just the first part of the string "the name"
middle captures just the number
end contains only the rest of the characters that are after the number portion
How do I accomplish this in the most efficent way?
Use regular expressions?
>>> import re
>>> strings = 'abc534loif tvd645kgjf tv96fjbd_gfgf'.split()
>>> for s in strings:
... for match in re.finditer(r'\b([a-z]+)(\d+)(.+?)\b', s):
... print match.groups()
...
('abc', '534', 'loif')
('tvd', '645', 'kgjf')
('tv', '96', 'fjbd_gfgf')
This is language agnostic approach that aims at higher efficiency:
find first digit in the string and save its position p0
find last digit in the string and save its position p1
extract substring from 0 to p0-1 into beginning
extract substring from p0 to p1 into middle
extract substring from p1+1 to length-1 into end
I guess you're looking for re.findall:
strs = """
abc534loif
tvd645kgjf
tv96fjbd_gfgf
"""
import re
print re.findall(r'\b(\w+?)(\d+)(\w+)', strs)
>> [('abc', '534', 'loif'), ('tvd', '645', 'kgjf'), ('tv', '96', 'fjbd_gfgf')]
>>> import itertools as it
>>> s="abc534loif"
>>> [''.join(j) for i,j in it.groupby(s, key=str.isdigit)]
['abc', '534', 'loif']
I'd something like this:
>>> import re
>>> l = ['abc534loif', 'tvd645kgjf', 'tv96fjbd_gfgf']
>>> regex = re.compile('([a-z_]+)(\d+)([a-z_]+)')
>>> beginning, middle, end = zip(*[regex.match(s).groups() for s in l])
>>> beginning
('abc', 'tvd', 'tv')
>>> middle
('534', '645', '96')
>>> end
('loif', 'kgjf', 'fjbd_gfgf')
I wouls use regualar expressions like:
(?P<beginning>[^0-9]*)(?P<middle>[^0-9]*)(?P<end>[^0-9]*)
and pull out the three matching sections.
import re
m = re.match(r"(?P<beginning>[^0-9]*)(?P<middle>[^0-9]*)(?P<end>[^0-9]*)", "abc534loif")
m.group('beginning')
m.group('middle')
m.group('end')
import re #You want to match a string against a pattern so you import the regular expressions module 're'
mystring = "abc1234def" #Just a string to test with
match = re.match(r"^(\D+)([0)9]+](\D+)$") #Our regular expression. Everything between brackets is 'captured', meaning that it is accessible as one of the 'groups' in the returned match object. The ^ sign matches at the beginning of a string, while the $ matches the end. the characters in between the square brackets [0-9] are character ranges, so [0-9] matches any digit character, \D is any non-digit character.
if match: # match will be None if the string didn't match the pattern, so we need to check for that, as None.group doesn't exist.
beginning = match.group(1)
middle = match.group(2)
end = match.group(3)

Categories

Resources