Split a string in Python by a specific line in text - python

I want to split a body of text if there is a line which contains only "----". I am using the re.split(..) method but it's not behaving as expected. What am I missing?
import re
s = """width:5
----
This is a test sentence to test the width thing"""
print re.split('^----$', s)
this simply prints
['width:5\n----\nThis is a test scentence to test the width thing']

You are missing the MULTILINE flag:
print re.split(r'^----$', s, flags=re.MULTILINE)
Without it ^ and $ were applied to the whole s string, not to the every line in the string:
re.MULTILINE
When specified, the pattern character '^' matches at the beginning of
the string and at the beginning of each line (immediately following
each newline); and the pattern character '$' matches at the end of the
string and at the end of each line (immediately preceding each
newline).
Demo:
>>> import re
>>>
>>> s = """width:5
... ----
... This is a test sentence to test the width thing"""
>>>
>>> print re.split(r'^----$', s, flags=re.MULTILINE)
['width:5\n', '\nThis is a test sentence to test the width thing']

Another way to split without using regex.
s.split("\n----\n")

less code make it perfect as expected:
IN:
re.split('[\n-]+', s, re.MULTILINE)
OUT:
['width:5', 'This is a test sentence to test the width thing']

Also you can dont use ^ and $ because that with ^ and $ you specify that regex engine match from first to end of string , and use Positive look-around to keep \n:
>>> print re.split('(?<=\n)----(?=\n)', s)
['width:5\n', '\nThis is a test sentence to test the width thing']

Did you try:
result = re.split("^----$", subject_text, 0, re.MULTILINE)

Related

Python regex: remove short lines

I have a string with multiple newline symbols:
text = 'foo\na\nb\n$\n\nxz\nbar'
I want to remove the lines that are shorter than 3 symbols. The desired output is
'foo\n\nbar'
I tried
re.sub(r'(\n([\s\S]{0,2})\n)+', '\nX\n', text, flags= re.S)
but this matches only some subset of the string and the result is
'foo\nX\nb\nX\nxz\nbar'
I need somehow to do greedy search and replace the longest string matching the pattern.
re.S makes . match everything including newline, and you don't want that. Instead use re.M so ^ matches beginning of string and after newline, and use:
>>> import re
>>> text = 'foo\na\nb\n$\n\nxz\nbar'
>>> re.findall('(?m)^.{0,2}\n',text)
['a\n', 'b\n', '$\n', '\n', 'xz\n']
>>> re.sub('(?m)^.{0,2}\n','',text)
'foo\nbar'
That's "from start of a line, match 0-2 non-newline characters, followed by a newline".
I noticed your desired output has a \n\n in it. If that isn't a mistake use .{1,2} if blank lines are to be left in.
You might also want to allow the final line of the string to have an optional terminating newline, for example:
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nbar') # 3 symbols at end, no newline
'foo\nbar'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nbar\n') # same, with newline
'foo\nbar\n'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nba\n') # <3 symbols, newline
'foo\n'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nba') # < 3 symbols, no newline
'foo\n'
Perhaps you can use re.findall instead:
text = 'foo\na\nb\n$\n\nxz\nbar'
import re
print (repr("".join(re.findall(r"\n?\w{3,}\n?",text))))
#
'foo\n\nbar'
You can use this regex, which looks for any set of less than 3 non-newline characters following either start-of-string or a newline and followed by a newline or end-of-string, and replace it with an empty string:
(^|\n)[^\n]{0,2}(?=\n|$)
In python:
import re
text = 'foo\na\nb\n$\n\nxz\nbar'
print(re.sub(r'(^|\n)[^\n]{0,2}(?=\n|$)', '', text))
Output
foo
bar
Demo on rextester
There's no need to use regex for this.
raw_str = 'foo\na\nb\n$\n\nxz\nbar'
str_res = '\n'.join([curr for curr in raw_str.splitlines() if len(curr) >= 3])
print(str_res):
foo
bar

Python Regex: Remove optional characters

I have a regex pattern with optional characters however at the output I want to remove those optional characters. Example:
string = 'a2017a12a'
pattern = re.compile("((20[0-9]{2})(.?)(0[1-9]|1[0-2]))")
result = pattern.search(string)
print(result)
I can have a match like this but what I want as an output is:
desired output = '201712'
Thank you.
You've already captured the intended data in groups and now you can use re.sub to replace the whole match with just contents of group1 and group2.
Try your modified Python code,
import re
string = 'a2017a12a'
pattern = re.compile(".*(20[0-9]{2}).?(0[1-9]|1[0-2]).*")
result = re.sub(pattern, r'\1\2', string)
print(result)
Notice, how I've added .* around the pattern, so any of the extra characters around your data is matched and gets removed. Also, removed extra parenthesis that were not needed. This will also work with strings where you may have other digits surrounding that text like this hello123 a2017a12a some other 99 numbers
Output,
201712
Regex Demo
You can just use re.sub with the pattern \D (=not a number):
>>> import re
>>> string = 'a2017a12a'
>>> re.sub(r'\D', '', string)
'201712'
Try this one:
import re
string = 'a2017a12a'
pattern = re.findall("(\d+)", string) # this regex will capture only digit
print("".join(p for p in pattern)) # combine all digits
Output:
201712
If you want to remove all character from string then you can do this
import re
string = 'a2017a12a'
re.sub('[A-Za-z]+','',string)
Output:
'201712'
You can use re module method to get required output, like:
import re
#method 1
string = 'a2017a12a'
print (re.sub(r'\D', '', string))
#method 2
pattern = re.findall("(\d+)", string)
print("".join(p for p in pattern))
You can also refer below doc for further knowledge.
https://docs.python.org/3/library/re.html

How can I "divide" words with regular expressions?

I have a sentence in which every token has a / in it. I want to just print what I have before the slash.
What I have now is basic:
text = less/RBR.....
return re.findall(r'\b(\S+)\b', text)
This obviously just prints the text, how do I cut off the words before the /?
Assuming you want all characters before the slash out of every word that contains a slash. This would mean e.g. for the input string match/this but nothing here but another/one you would want the results match and another.
With regex:
import re
result = re.findall(r"\b(\w*?)/\w*?\b", my_string)
print(result)
Without regex:
result = [word.split("/")[0] for word in my_string.split()]
print(result)
Simple and straight-forward:
rx = r'^[^/]+'
# anchor it to the beginning
# the class says: match everything not a forward slash as many times as possible
In Python this would be:
import re
text = "less/RBR....."
print re.match(r'[^/]+', text)
As this is an object, you'd probably like to print it out, like so:
print re.match(r'[^/]+', text).group(0)
# less
This should also work
\b([^\s/]+)(?=/)\b
Python Code
p = re.compile(r'\b([^\s/]+)(?=/)\b')
test_str = "less/RBR/...."
print(re.findall(p, test_str))
Ideone Demo

regex for repeating words in a string in Python

I have a good regexp for replacing repeating characters in a string. But now I also need to replace repeating words, three or more word will be replaced by two words.
Like
bye! bye! bye!
should become
bye! bye!
My code so far:
def replaceThreeOrMoreCharachetrsWithTwoCharacters(string):
# pattern to look for three or more repetitions of any character, including newlines.
pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
return pattern.sub(r"\1\1", string)
Assuming that what is called "word" in your requirements is one or more non-whitespaces characters surrounded by whitespaces or string limits, you can try this pattern:
re.sub(r'(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)', r'\1', s)
You could try the below regex also,
(?<= |^)(\S+)(?: \1){2,}(?= |$)
Sample code,
>>> import regex
>>> s = "hi hi hi hi some words words words which'll repeat repeat repeat repeat repeat"
>>> m = regex.sub(r'(?<= |^)(\S+)(?: \1){2,}(?= |$)', r'\1 \1', s)
>>> m
"hi hi some words words which'll repeat repeat"
DEMO
I know you were after a regular expression but you could use a simple loop to achieve the same thing:
def max_repeats(s, max=2):
last = ''
out = []
for word in s.split():
same = 0 if word != last else same + 1
if same < max: out.append(word)
last = word
return ' '.join(out)
As a bonus, I have allowed a different maximum number of repeats to be specified (the default is 2). If there is more than one space between each word, it will be lost. It's up to you whether you consider that to be a bug or a feature :)
Try the following:
import re
s = your string
s = re.sub( r'(\S+) (?:\1 ?){2,}', r'\1 \1', s )
You can see a sample code here: http://codepad.org/YyS9JCLO
def replaceThreeOrMoreWordsWithTwoWords(string):
# Pattern to look for three or more repetitions of any words.
pattern = re.compile(r"(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)", re.DOTALL)
return pattern.sub(r"\1", string)

Find all words in a string that start with the $ sign in Python

How can I extract all words in a string that start with the $ sign? For example in the string
This $string is an $example
I want to extract the words $string and $example.
I tried with this regex \b[$]\S* but it works fine only if I use a normal character rather than dollar.
>>> [word for word in mystring.split() if word.startswith('$')]
['$string', '$example']
The problem with your expr is that \b doesn't match between a space and a $. If you remove it, everything works:
z = 'This $string is an $example'
import re
print re.findall(r'[$]\S*', z) # ['$string', '$example']
To avoid matching words$like$this, add a lookbehind assertion:
z = 'This $string is an $example and this$not'
import re
print re.findall(r'(?<=\W)[$]\S*', z) # ['$string', '$example']
The \b escape matches at word boundaries, but the $ sign is not considered part of word you can match. Match on the start or spaces instead:
re.compile(r'(?:^|\s)(\$\w+)')
I've used a backslash escape for the dollar here instead of a character class, and the \w+ word character class with a minimum of 1 character to better reflect your intent.
Demo:
>>> import re
>>> dollaredwords = re.compile(r'(?:^|\s)(\$\w+)')
>>> dollaredwords.search('Here is an $example for you!')
<_sre.SRE_Match object at 0x100882a80>
Several approaches, depending on what you want define as a 'word' and if all are delineated by spaces:
>>> s='This $string is an $example $second$example'
>>> re.findall(r'(?<=\s)\$\w+',s)
['$string', '$example', '$second']
>>> re.findall(r'(?<=\s)\$\S+',s)
['$string', '$example', '$second$example']
>>> re.findall(r'\$\w+',s)
['$string', '$example', '$second', '$example']
If you might have a 'word' at the beginning of a line:
>>> re.findall(r'(?:^|\s)(\$\w+)','$string is an $example $second$example')
['$string', '$example', '$second']
Using regex:
z = 'This $string is an $example to$ get o$n$l$y words $sta$rts with $'
print(re.findall(r'\s$[\S]*', z))
Output: [' $string', ' $example', ' $sta$rts', ' $']

Categories

Resources