re split to break a string into components but keeping separators

re split to break a string into components but keeping separators - python

I want to break a string into components
s = 'Hello [foo] world!'
re.split(r'\[(.*?)\]', s)
which gives me
['Hello ', 'foo', ' world!']
But I want to achieve
['Hello ', '[foo]', ' world!']
Please help!

Use
import re
s = 'Hello [foo] world!'
print(re.split(r'(\[[^][]*])', s))
See Python proof.
Results: ['Hello ', '[foo]', ' world!']
Explanation
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\[ '['
--------------------------------------------------------------------------------
[^][]* any character except: ']', '[' (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
] ']'
--------------------------------------------------------------------------------
) end of \1

Related

RegEx for matching capital letters and numbers

Hi I have a lot of corpus I parse them to extract all patterns:
like how to extract all patterns like: AP70, ML71, GR55, etc..
and all patterns for a sequence of words that start with capital letter like: Hello Little Monkey, How Are You, etc..
For the first case I did this regexp and don't get all matches:
>>> p = re.compile("[A-Z]+[0-9]+")
>>> res = p.search("aze azeaz GR55 AP1 PM89")
>>> res
<re.Match object; span=(10, 14), match='GR55'>
and for the second one:
>>> s = re.compile("[A-Z]+[a-z]+\s[A-Z]+[a-z]+\s[A-Z]+[a-z]+")
>>> resu = s.search("this is a test string, Hello Little Monkey, How Are You ?")
>>> resu
<re.Match object; span=(23, 42), match='Hello Little Monkey'>
>>> resu.group()
'Hello Little Monkey'
it's seems working but I want to get all matches when parsing a whole 'big' line.

Try these 2 regex:
(for safety, they are enclosed by whitespace/comma boundary's)
>>> import re
>>> teststr = "aze azeaz GR55 AP1 PM89"
>>> res = re.findall(r"(?<![^\s,])[A-Z]+[0-9]+(?![^\s,])", teststr)
>>> print(res)
['GR55', 'AP1', 'PM89']
>>>
Readable regex
(?<! [^\s,] )
[A-Z]+ [0-9]+
(?! [^\s,] )
and
>>> import re
>>> teststr = "this is a test string, ,Hello Little Monkey, How Are You ?"
>>> res = re.findall(r"(?<![^\s,])[A-Z]+[a-z]+(?:\s[A-Z]+[a-z]+){1,}(?![^\s,])", teststr)
>>> print(res)
['Hello Little Monkey', 'How Are You']
>>>
Readable regex
(?<! [^\s,] )
[A-Z]+ [a-z]+
(?: \s [A-Z]+ [a-z]+ ){1,}
(?! [^\s,] )

This expression might help you to do so, or design one. It seems you wish that your expression would contain at least one [A-Z] and at least one [0-9]:
(?=[A-Z])(?=.+[0-9])([A-Z0-9]+)
Graph
This graph shows how your expression would work, and you can test more in this link:
Example Code:
This code shows how the expression would work in Python:
# -*- coding: UTF-8 -*-
import re
string = "aze azeaz GR55 AP1 PM89"
expression = r'(?=[A-Z])(?=.+[0-9])([A-Z0-9]+)'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else:
print('🙀 Sorry! No matches! Something is not right! Call 911 👮')
Example Output
YAAAY! "GR55" is a match 💚💚💚
Performance
This JavaScript snippet shows the performance of your expression using a simple 1-million times for loop.
repeat = 1000000;
start = Date.now();
for (var i = repeat; i >= 0; i--) {
var string = 'aze azeaz GR55 AP1 PM89';
var regex = /(.*?)(?=[A-Z])(?=.+[0-9])([A-Z0-9]+)/g;
var match = string.replace(regex, "$2 ");
}
end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");

Replace leading whitespace with other other char - Python

I want to replace my leading whitespace with a nbsp; per whitespace.
So:
spam --> spam
eggs --> eggs
spam eggs --> spam eggs
I've seen a couple of solutions using regex, but all are in other languages.
I've tried the following in Python but with no luck.
import re
raw_line = ' spam eggs'
line = re.subn('\s+', ' ', raw_line, len(raw_line))
print(line) # outputs spam eggs
line = re.sub('\s+', ' ', raw_line)
print(line) # outputs spam eggs
line = re.sub('^\s', ' ', raw_line)
print(line) # outputs spam eggs
line = re.sub('^\s+', ' ', raw_line)
print(line) # outputs spam eggs
Last line seems to be closest, but yet no cigar.
What is the proper way to replace each leading whitespace with in Python?
If there is a clean way to do this without regex, I will gladly accept, but I couldn't figure it out by myself.

You don't even need expensive regex here, just strip out the leading whitespace and prepend a number of characters for the number of stripped characters:
def replace_leading(source, char=" "):
stripped = source.lstrip()
return char * (len(source) - len(stripped)) + stripped
print(replace_leading("spam")) # spam
print(replace_leading(" eggs")) # eggs
print(replace_leading(" spam eggs")) # spam eggs

You can use re.sub with a callback function and evaluate the length of the match:
>>> raw_line = ' spam eggs'
>>> re.sub(r"^\s+", lambda m: " " * len(m.group()), raw_line)
' spam eggs'

With regex module (answered in comment by Wiktor Stribiżew)
>>> import regex
>>> line = 'spam'
>>> regex.sub(r'\G\s', ' ', line)
'spam'
>>> line = ' eggs'
>>> regex.sub(r'\G\s', ' ', line)
' eggs'
>>> line = ' spam eggs'
>>> regex.sub(r'\G\s', ' ', line)
' spam eggs'
From documentation:
\G
A search anchor has been added. It matches at the position where each
search started/continued and can be used for contiguous matches or in
negative variable-length lookbehinds to limit how far back the
lookbehind goes

A non regex solution:
s = ' spam eggs'
s_s = s.lstrip()
print(' '*(len(s) - len(s_s)) + s_s)
# spam eggs

Regular expression for word with specific prefix/suffix

i want to match the word only if the word is surrounded with a maximum of 1 wild character on either side followed by space or nothing on either side. for example I want ring to match 'ring' , ' ring' , ' tring', 'ring ', ' ringt', '' ringt ', ' ring ', 'tringt ', 'tringt '
but not:
'ttring', 'ringttt', 'ttringtt'
so far I have:
[?\s\S]ring[?\s\S][?!\s]
any suggestions?

If i understand correctly, this should do:
(?:^|\s)\S?ring\S?(?:\s|$)
(?:^|\s) - this non-capturing group makes sure that the pattern is preceded by a whitespace or at the beginning
\S? matches zero or one non-whitespace character
ring matches literal ring
(?:\s|$) - the zero width positive lookahead makes sure the match is preceded by a space or is at the end
Example:
In [92]: l = ['ring ', ' ringt', ' ringt ', ' ring ', \
'tringt ', 'tringt ', 'ttring', 'ringttt', 'ttringtt']
In [93]: list(filter(lambda s: re.search(r'(?:^|\s)\S?ring\S?(?:\s|$)', s), l))
Out[93]: ['ring ', ' ringt', ' ringt ', ' ring ', 'tringt ', 'tringt ']

Regex should handle whitespace including newline differently

My goal is to make a regex that can handle 2 situations:
Multiple whitespace including one or more newlines in any order should become a single newline
Multiple whitespace excluding any newline should become a space
The unorderedness combined with the different cases for newline and no newline is what makes this complex.
What is the most efficient way to do this?
E.g.
' \n \n \n a' # --> '\na'
' \t \t a' # --> ' a'
' \na\n ' # --> '\na\n'
Benchmark:
s = ' \n \n \n a \t \t a \na\n '
n_times = 1000000
------------------------------------------------------
change_whitespace(s) - 5.87 s
change_whitespace_2(s) - 3.51 s
change_whitespace_3(s) - 3.93 s
n_times = 100000
------------------------------------------------------
change_whitespace(s * 100) - 27.9 s
change_whitespace_2(s * 100) - 16.8 s
change_whitespace_3(s * 100) - 19.7 s

(Assumes Python can do regex replace with callback function)
You could use some callback to see what the replacement needs to be.
Group 1 matches, replace with space.
Group 2 matches, replace with newline
(?<!\s)(?:([^\S\r\n]+)|(\s+))(?!\s)
(?<! \s ) # No whitespace behind
(?:
( [^\S\r\n]+ ) # (1), Non-linebreak whitespace
|
( \s+ ) # (2), At least 1 linebreak
)
(?! \s ) # No whitespace ahead

This replaces the whitespace that contains a newline with a single newline, then replaces the whitespace that doesn't contain a newline with a single space.
import re
def change_whitespace(string):
return re.sub('[ \t\f\v]+', ' ', re.sub('[\s]*[\n\r]+[\s]*', '\n', string))
Results:
>>> change_whitespace(' \n \n \n a')
'\na'
>>> change_whitespace(' \t \t a')
' a'
>>> change_whitespace(' \na\n ')
'\na\n'
Thanks to #sln for reminding me of regex callback functions:
def change_whitespace_2(string):
return re.sub('\s+', lambda x: '\n' if '\n' in x.group(0) else ' ', string)
Results:
>>> change_whitespace_2(' \n \n \n a')
'\na'
>>> change_whitespace_2(' \t \t a')
' a'
>>> change_whitespace_2(' \na\n ')
'\na\n'
And here's a function with #sln's expression:
def change_whitespace_3(string):
return re.sub('(?<!\s)(?:([^\S\r\n]+)|(\s+))(?!\s)', lambda x: ' ' if x.group(1) else '\n', string)
Results:
>>> change_whitespace_3(' \n \n \n a')
'\na'
>>> change_whitespace_3(' \t \t a')
' a'
>>> change_whitespace_3(' \na\n ')
'\na\n'

Regex conditionally remove text within parens

I was wondering how to remove the following text from a string in python using regular expressions.
string = "Hello (John)"
(magic regex)
string = "Hello "
However, I only want to remove the text in the parens if it contains the substring "John". So for example,
string = "Hello (Sally)"
(magic regex)
string = "Hello (Sally)"
Is this possible? Thanks!

This should be the gist of what you want:
>>> from re import sub
>>> mystr = "Hello (John)"
>>> sub("(?s)\(.*?John.*?\)", "", mystr)
'Hello '
>>> mystr = "Hello (Sally)"
>>> sub("(?s)\(.*?John.*?\)", "", mystr)
'Hello (Sally)'
>>> mystr = "Hello (John) My John (Sally)"
>>> sub("(?s)\(.*?John.*?\)", "", mystr)
'Hello My John (Sally)'
>>>
Breakdown:
(?s) # Dot-all flag to have . match newline characters
\( # Opening parenthesis
.*? # Zero or more characters matching non-greedily
John # Target
.*? # Zero or more characters matching non-greedily
\) # Closing parenthesis

If you to just remove all instances of John, you can do:
string = "Hello (John)"
string.replace("(John)", "")
print(string) # Prints "Hello "

import re
REGEX = re.compile(r'\(([^)]+)\)')
def replace(match):
if 'John' in match.groups()[0]:
return ''
return '(' + match.groups()[0] + ')'
my_string = 'Hello (John)'
print REGEX.sub(replace, my_string)
my_string = 'Hello (test John string)'
print REGEX.sub(replace, my_string)
my_string = 'Hello (Sally)'
print REGEX.sub(replace, my_string)
Hello
Hello
Hello (Sally)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

re split to break a string into components but keeping separators - python

I want to break a string into components s = 'Hello [foo] world!' re.split(r'\[(.*?)\]', s) which gives me ['Hello ', 'foo', ' world!'] But I want to achieve ['Hello ', '[foo]', ' world!'] Please help!

Related

RegEx for matching capital letters and numbers

Replace leading whitespace with other other char - Python

Regular expression for word with specific prefix/suffix

Regex should handle whitespace including newline differently

Regex conditionally remove text within parens

Categories

Resources