Python regex: remove short lines - python

I have a string with multiple newline symbols:
text = 'foo\na\nb\n$\n\nxz\nbar'
I want to remove the lines that are shorter than 3 symbols. The desired output is
'foo\n\nbar'
I tried
re.sub(r'(\n([\s\S]{0,2})\n)+', '\nX\n', text, flags= re.S)
but this matches only some subset of the string and the result is
'foo\nX\nb\nX\nxz\nbar'
I need somehow to do greedy search and replace the longest string matching the pattern.

re.S makes . match everything including newline, and you don't want that. Instead use re.M so ^ matches beginning of string and after newline, and use:
>>> import re
>>> text = 'foo\na\nb\n$\n\nxz\nbar'
>>> re.findall('(?m)^.{0,2}\n',text)
['a\n', 'b\n', '$\n', '\n', 'xz\n']
>>> re.sub('(?m)^.{0,2}\n','',text)
'foo\nbar'
That's "from start of a line, match 0-2 non-newline characters, followed by a newline".
I noticed your desired output has a \n\n in it. If that isn't a mistake use .{1,2} if blank lines are to be left in.
You might also want to allow the final line of the string to have an optional terminating newline, for example:
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nbar') # 3 symbols at end, no newline
'foo\nbar'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nbar\n') # same, with newline
'foo\nbar\n'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nba\n') # <3 symbols, newline
'foo\n'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nba') # < 3 symbols, no newline
'foo\n'

Perhaps you can use re.findall instead:
text = 'foo\na\nb\n$\n\nxz\nbar'
import re
print (repr("".join(re.findall(r"\n?\w{3,}\n?",text))))
#
'foo\n\nbar'

You can use this regex, which looks for any set of less than 3 non-newline characters following either start-of-string or a newline and followed by a newline or end-of-string, and replace it with an empty string:
(^|\n)[^\n]{0,2}(?=\n|$)
In python:
import re
text = 'foo\na\nb\n$\n\nxz\nbar'
print(re.sub(r'(^|\n)[^\n]{0,2}(?=\n|$)', '', text))
Output
foo
bar
Demo on rextester

There's no need to use regex for this.
raw_str = 'foo\na\nb\n$\n\nxz\nbar'
str_res = '\n'.join([curr for curr in raw_str.splitlines() if len(curr) >= 3])
print(str_res):
foo
bar

Related

Remove Space Before and Atfer a Specific Character Using Regex

Can anyone show me how to remove spaces before and after a hyphen? The code below works on removing the space fore the hyphen but I need to removed before and after the hyphen.
#!/usr/bin/python3
import re
test_strings = ["(1973) -trailer.mp4", "(1973)- fanart.jpg", "(1973) - poster.jpg"]
for i in test_strings:
res = re.sub(' +-', '-', i)
print (res)
You could probably do it in a single regex, but since I'm sometimes lazy, just chain two together, like:
#!/usr/bin/python3
import re
test_strings = ["(1973) -trailer.mp4", "(1973)- fanart.jpg", "(1973) - poster.jpg"]
for i in test_strings:
res = re.sub('- +','-', re.sub(' +-', '-', i))
print (res)
Edit: MrGeek has a better answer in a comment.
You can use pattern \s*-\s* where \s to represent any whitespace character, if you strictly want to use space then you can just use space character in the pattern i.e. *- *.
>>> pattern = re.compile('\s*-\s*')
>>> [pattern.sub('-', item) for item in test_strings]
#output:
['(1973)-trailer.mp4', '(1973)-fanart.jpg', '(1973)-poster.jpg']

Python Regex: Remove optional characters

I have a regex pattern with optional characters however at the output I want to remove those optional characters. Example:
string = 'a2017a12a'
pattern = re.compile("((20[0-9]{2})(.?)(0[1-9]|1[0-2]))")
result = pattern.search(string)
print(result)
I can have a match like this but what I want as an output is:
desired output = '201712'
Thank you.
You've already captured the intended data in groups and now you can use re.sub to replace the whole match with just contents of group1 and group2.
Try your modified Python code,
import re
string = 'a2017a12a'
pattern = re.compile(".*(20[0-9]{2}).?(0[1-9]|1[0-2]).*")
result = re.sub(pattern, r'\1\2', string)
print(result)
Notice, how I've added .* around the pattern, so any of the extra characters around your data is matched and gets removed. Also, removed extra parenthesis that were not needed. This will also work with strings where you may have other digits surrounding that text like this hello123 a2017a12a some other 99 numbers
Output,
201712
Regex Demo
You can just use re.sub with the pattern \D (=not a number):
>>> import re
>>> string = 'a2017a12a'
>>> re.sub(r'\D', '', string)
'201712'
Try this one:
import re
string = 'a2017a12a'
pattern = re.findall("(\d+)", string) # this regex will capture only digit
print("".join(p for p in pattern)) # combine all digits
Output:
201712
If you want to remove all character from string then you can do this
import re
string = 'a2017a12a'
re.sub('[A-Za-z]+','',string)
Output:
'201712'
You can use re module method to get required output, like:
import re
#method 1
string = 'a2017a12a'
print (re.sub(r'\D', '', string))
#method 2
pattern = re.findall("(\d+)", string)
print("".join(p for p in pattern))
You can also refer below doc for further knowledge.
https://docs.python.org/3/library/re.html

Remove trailing special characters from string

I'm trying to use a regex to clean some data before I insert the items into the database. I haven't been able to solve the issue of removing trailing special characters at the end of my strings.
How do I write this regex to only remove trailing special characters?
import re
strings = ['string01_','str_ing02_^','string03_#_', 'string04_1', 'string05_a_']
for item in strings:
clean_this = (re.sub(r'([_+!##$?^])', '', item))
print (clean_this)
outputs this:
string01 # correct
string02 # incorrect because it remove _ in the string
string03 # correct
string041 # incorrect because it remove _ in the string
string05a # incorrect because it remove _ in the string and not just the trailing _
You could also use the special purpose rstrip method of strings
[s.rstrip('_+!##$?^') for s in strings]
# ['string01', 'str_ing02', 'string03', 'string04_1', 'string05_a']
You could repeat the character class 1+ times or else only 1 special character would be replaced. Then assert the end of the string $. Note that you don't need the capturing group around the character class:
[_+!##$?^]+$
For example:
import re
strings = ['string01_','str_ing02_^','string03_#_', 'string04_1', 'string05_a_']
for item in strings:
clean_this = (re.sub(r'[_+!##$?^]+$', '', item))
print (clean_this)
See the Regex demo | Python demo
If you also want to remove whitespace characters at the end you could add \s to the character class:
[_+!##$?^\s]+$
Regex demo
You need an end-of-word anchor $
clean_this = (re.sub(r'[_+!##$?^]+$', '', item))
Demo

python regex - replace newline (\n) to something else

I'm trying to convert multiple continuous newline characters followed by a Capital Letter to "____" so that I can parse them.
For example,
i = "Inc\n\nContact"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
In [25]: i
Out [25]: 'Inc____Contact'
This string works fine. I can parse them using ____ later.
However it doesn't work on this particular string.
i = "(2 months)\n\nML"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
Out [31]: '(2 months)____L'
It ate capital M.
What am I missing here?
EDIT To replace multiple continuous newline characters (\n) to ____, this should do:
>>> import re
>>> i = "(2 months)\n\nML"
>>> re.sub(r'(\n+)(?=[A-Z])', r'____', i)
'(2 months)____ML'
(?=[A-Z]) is to assert "newline characters followed by Capital Letter". REGEX DEMO.
Well let's take a look at your regex ([\n]+)([A-Z])+ - the first part ([\n]+) is fine, matching multiple occurences of a newline into one group (note - this wont match the carriage return \r). However the second part ([A-Z])+ leeds to your error it matches a single uppercase letter into a capturing group - multiple times, if there are multiple Uppercase letter, which will reset the group to the last matched uppercase letter, which is then used for the replace.
Try the following and see what happens
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
You could simply place the + inside the capturing group, so multiple uppercase letters are matched into it. You could also just leave it out, as it doesn't make a difference, how many of these uppercase letters follow.
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)([A-Z])', r"____\2", i)
If you want to replace any sequence of linebreaks, no matter what follows - drop the ([A-Z]) completely and try
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)', r"____", i)
You could also use ([\r\n]+) as pattern, if you want to consider carriage returns
Try:
import re
p = re.compile(ur'[\r?\n]')
test_str = u"(2 months)\n\nML"
subst = u"_"
result = re.sub(p, subst, test_str)
It will reduce string to
(2 months)__ML
See Demo

Extract unicode substrings with the re module

I have a string like this:
s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'
I want this text:
result = 'the unicode text I want with an é'
I've tried to use this code:
expr = r'(?<=BEGIN)[\sa-zA-Z]+(?=END)'
result = re.search(expr, s)
result = re.sub(r'(^\s+)|(\s+$)', '', result) # just to strip out leading/trailing white space
But as long as the é is in the string s, re.search always returns None.
Note, I've tried using different combinations of .* instead of [\sa-zA-Z]+ without success.
The character ranges a-z and A-Z only capture ASCII characters. You can use . to capture Unicode characters:
>>> import re
>>> s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'
>>> print re.search(r'BEGIN(.+?)END', s).group(1)
the unicode text I want with an é
>>>
Note too that I simplified your pattern a bit. Here is what it does:
BEGIN # Matches BEGIN
(.+?) # Captures one or more characters non-greedily
END # Matches END
Also, you do not need Regex to remove whitespace from the ends of a string. Just use str.strip:
>>> ' a '.strip()
'a'
>>>

Categories

Resources