Regex should handle whitespace including newline differently - python

My goal is to make a regex that can handle 2 situations:
Multiple whitespace including one or more newlines in any order should become a single newline
Multiple whitespace excluding any newline should become a space
The unorderedness combined with the different cases for newline and no newline is what makes this complex.
What is the most efficient way to do this?
E.g.
' \n \n \n a' # --> '\na'
' \t \t a' # --> ' a'
' \na\n ' # --> '\na\n'
Benchmark:
s = ' \n \n \n a \t \t a \na\n '
n_times = 1000000
------------------------------------------------------
change_whitespace(s) - 5.87 s
change_whitespace_2(s) - 3.51 s
change_whitespace_3(s) - 3.93 s
n_times = 100000
------------------------------------------------------
change_whitespace(s * 100) - 27.9 s
change_whitespace_2(s * 100) - 16.8 s
change_whitespace_3(s * 100) - 19.7 s

(Assumes Python can do regex replace with callback function)
You could use some callback to see what the replacement needs to be.
Group 1 matches, replace with space.
Group 2 matches, replace with newline
(?<!\s)(?:([^\S\r\n]+)|(\s+))(?!\s)
(?<! \s ) # No whitespace behind
(?:
( [^\S\r\n]+ ) # (1), Non-linebreak whitespace
|
( \s+ ) # (2), At least 1 linebreak
)
(?! \s ) # No whitespace ahead

This replaces the whitespace that contains a newline with a single newline, then replaces the whitespace that doesn't contain a newline with a single space.
import re
def change_whitespace(string):
return re.sub('[ \t\f\v]+', ' ', re.sub('[\s]*[\n\r]+[\s]*', '\n', string))
Results:
>>> change_whitespace(' \n \n \n a')
'\na'
>>> change_whitespace(' \t \t a')
' a'
>>> change_whitespace(' \na\n ')
'\na\n'
Thanks to #sln for reminding me of regex callback functions:
def change_whitespace_2(string):
return re.sub('\s+', lambda x: '\n' if '\n' in x.group(0) else ' ', string)
Results:
>>> change_whitespace_2(' \n \n \n a')
'\na'
>>> change_whitespace_2(' \t \t a')
' a'
>>> change_whitespace_2(' \na\n ')
'\na\n'
And here's a function with #sln's expression:
def change_whitespace_3(string):
return re.sub('(?<!\s)(?:([^\S\r\n]+)|(\s+))(?!\s)', lambda x: ' ' if x.group(1) else '\n', string)
Results:
>>> change_whitespace_3(' \n \n \n a')
'\na'
>>> change_whitespace_3(' \t \t a')
' a'
>>> change_whitespace_3(' \na\n ')
'\na\n'

Related

How to Insert space between a special character and everything else

I have some text for latex that I am working on and I need to clean it in order to split it properly based on spacing.
So the string:
\\mathrm l >\\mathrm li ^ + >\\mathrm mg ^ +>\\mathrm a \\beta+ \\mathrm co
should be:
\\mathrm l > \\mathrm li ^ + > \\mathrm mg ^ + > \\mathrm a \\beta + \\mathrm co
So in order for me to split it, I have to create spacing between every character if it is a special character. Also I want to keep the latex notation intact as \something.
I can have re.compile([a-zA-Z0-9 \\]) to get all the special characters but then how can I approach to inser spaces?
I have written a code something like this but it does not look good in terms of efficiency. (or is it?)
def insert_space(sentence):
'''
Add a space around special characters So "x+y +-=y \\latex" becomes: "x + y + - = y \\latex"
'''
string = ''
for i in sentence:
if (not i.isalnum()) and i not in [' ','\\']:
string += ' '+i+' '
else:
string += i
return re.sub('\s+', ' ',string)
I haven't used LaTeX so if you're sure that [a-zA-Z0-9 \\] captures everything that isn't a special character you could do something like this.
import re
def insert_space(sentence):
sentence = re.sub(r'(?<! )(?![a-zA-Z0-9 \\])', ' ', sentence)
sentence = re.sub(r'(?<!^)(?<![a-zA-Z0-9 \\])(?! )', ' ', sentence)
return sentence
my_string = '\\mathrm l >\\mathrm li ^ + >\\mathrm mg ^ +>\\mathrm a \\beta+ \\mathrm co'
print('before', my_string)
# before \mathrm l >\mathrm li ^ + >\mathrm mg ^ +>\mathrm a \beta+ \mathrm co
print('after', insert_space(my_string))
# after \mathrm l > \mathrm li ^ + > \mathrm mg ^ + > \mathrm a \beta + \mathrm co
The first regex is:
(?<! ) Negative look behind for a space.
(?![a-zA-Z0-9 \\]) Negative look ahead for the character class you specified.
Replace all of these occurrences with a space ' '.
The second regex is:
(?<!^) Negative look behind for the start of the string.
(?<![a-zA-Z0-9 \\]) Negative look behind for the character class you specified.
(?! ) Negative look ahead for a space.
Replace all of these occurrences with a space ' '.
So effectively, it's first finding all the spaces between special characters and another character that is not a space and inserting a space at that position.
The reason you need to also include (?<!^) is to ignore the position between the start of the string and the first character. Otherwise it will include an extra space at the beginning.

re split to break a string into components but keeping separators

I want to break a string into components
s = 'Hello [foo] world!'
re.split(r'\[(.*?)\]', s)
which gives me
['Hello ', 'foo', ' world!']
But I want to achieve
['Hello ', '[foo]', ' world!']
Please help!
Use
import re
s = 'Hello [foo] world!'
print(re.split(r'(\[[^][]*])', s))
See Python proof.
Results: ['Hello ', '[foo]', ' world!']
Explanation
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\[ '['
--------------------------------------------------------------------------------
[^][]* any character except: ']', '[' (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
] ']'
--------------------------------------------------------------------------------
) end of \1

Python: match and replace all whitespaces at the beginning of each line

I need to convert text like this:
' 1 white space before string'
' 2 white spaces before string'
' 3 white spaces before string'
Into a:
' 1 white space before string'
' 2 white spaces before string'
' 3 white spaces before string'
Whitespaces between words and at the end of the line should not be matched, only at the beginning. Also, no need to match tabs. Big thx for help
Use re.sub with a callback that performs the actual replacement:
import re
list_of_strings = [...]
p = re.compile('^ +')
for i, l in enumerate(list_of_strings):
list_of_strings[i] = p.sub(lambda x: x.group().replace(' ', ' '), l)
print(list_of_strings)
[' 1 white space before string',
' 2 white spaces before string',
' 3 white spaces before string'
]
The pattern used here is '^ +' and will search for, and replace whitespaces as long as they're at the start of your string.
If you know it's just spaces as leading whitespace, you could do something like this:
l = ' ' * (len(l) - len(l.lstrip())) + l.lstrip()
Not the most efficient though. This would be a bit better:
stripped = l.strip()
l = ' ' * (len(l) - len(stripped)) + stripped
print(l)
It's one way to do it without the re overhead.
For example:
lines = [
' 1 white space before string',
' 2 white spaces before string',
' 3 white spaces before string',
]
for l in lines:
stripped = l.strip()
l = ' ' * (len(l) - len(stripped)) + stripped
print(l)
Output:
1 white space before string
2 white spaces before string
3 white spaces before string

Regular expression for word with specific prefix/suffix

i want to match the word only if the word is surrounded with a maximum of 1 wild character on either side followed by space or nothing on either side. for example I want ring to match 'ring' , ' ring' , ' tring', 'ring ', ' ringt', '' ringt ', ' ring ', 'tringt ', 'tringt '
but not:
'ttring', 'ringttt', 'ttringtt'
so far I have:
[?\s\S]ring[?\s\S][?!\s]
any suggestions?
If i understand correctly, this should do:
(?:^|\s)\S?ring\S?(?:\s|$)
(?:^|\s) - this non-capturing group makes sure that the pattern is preceded by a whitespace or at the beginning
\S? matches zero or one non-whitespace character
ring matches literal ring
(?:\s|$) - the zero width positive lookahead makes sure the match is preceded by a space or is at the end
Example:
In [92]: l = ['ring ', ' ringt', ' ringt ', ' ring ', \
'tringt ', 'tringt ', 'ttring', 'ringttt', 'ttringtt']
In [93]: list(filter(lambda s: re.search(r'(?:^|\s)\S?ring\S?(?:\s|$)', s), l))
Out[93]: ['ring ', ' ringt', ' ringt ', ' ring ', 'tringt ', 'tringt ']

How do I trim whitespace from a string?

How do I remove leading and trailing whitespace from a string in Python?
" Hello world " --> "Hello world"
" Hello world" --> "Hello world"
"Hello world " --> "Hello world"
"Hello world" --> "Hello world"
To remove all whitespace surrounding a string, use .strip(). Examples:
>>> ' Hello '.strip()
'Hello'
>>> ' Hello'.strip()
'Hello'
>>> 'Bob has a cat'.strip()
'Bob has a cat'
>>> ' Hello '.strip() # ALL consecutive spaces at both ends removed
'Hello'
Note that str.strip() removes all whitespace characters, including tabs and newlines. To remove only spaces, specify the specific character to remove as an argument to strip:
>>> " Hello\n ".strip(" ")
'Hello\n'
To remove only one space at most:
def strip_one_space(s):
if s.endswith(" "): s = s[:-1]
if s.startswith(" "): s = s[1:]
return s
>>> strip_one_space(" Hello ")
' Hello'
As pointed out in answers above
my_string.strip()
will remove all the leading and trailing whitespace characters such as \n, \r, \t, \f, space .
For more flexibility use the following
Removes only leading whitespace chars: my_string.lstrip()
Removes only trailing whitespace chars: my_string.rstrip()
Removes specific whitespace chars: my_string.strip('\n') or my_string.lstrip('\n\r') or my_string.rstrip('\n\t') and so on.
More details are available in the docs.
strip is not limited to whitespace characters either:
# remove all leading/trailing commas, periods and hyphens
title = title.strip(',.-')
This will remove all leading and trailing whitespace in myString:
myString.strip()
You want strip():
myphrases = [" Hello ", " Hello", "Hello ", "Bob has a cat"]
for phrase in myphrases:
print(phrase.strip())
This can also be done with a regular expression
import re
input = " Hello "
output = re.sub(r'^\s+|\s+$', '', input)
# output = 'Hello'
Well seeing this thread as a beginner got my head spinning. Hence came up with a simple shortcut.
Though str.strip() works to remove leading & trailing spaces it does nothing for spaces between characters.
words=input("Enter the word to test")
# If I have a user enter discontinous threads it becomes a problem
# input = " he llo, ho w are y ou "
n=words.strip()
print(n)
# output "he llo, ho w are y ou" - only leading & trailing spaces are removed
Instead use str.replace() to make more sense plus less error & more to the point.
The following code can generalize the use of str.replace()
def whitespace(words):
r=words.replace(' ','') # removes all whitespace
n=r.replace(',','|') # other uses of replace
return n
def run():
words=input("Enter the word to test") # take user input
m=whitespace(words) #encase the def in run() to imporve usability on various functions
o=m.count('f') # for testing
return m,o
print(run())
output- ('hello|howareyou', 0)
Can be helpful while inheriting the same in diff. functions.
In order to remove "Whitespace" which causes plenty of indentation errors when running your finished code or programs in Pyhton. Just do the following;obviously if Python keeps telling that the error(s) is indentation in line 1,2,3,4,5, etc..., just fix that line back and forth.
However, if you still get problems about the program that are related to typing mistakes, operators, etc, make sure you read why error Python is yelling at you:
The first thing to check is that you have your
indentation right. If you do, then check to see if you have
mixed tabs with spaces in your code.
Remember: the code
will look fine (to you), but the interpreter refuses to run it. If
you suspect this, a quick fix is to bring your code into an
IDLE edit window, then choose Edit..."Select All from the
menu system, before choosing Format..."Untabify Region.
If you’ve mixed tabs with spaces, this will convert all your
tabs to spaces in one go (and fix any indentation issues).
I could not find a solution to what I was looking for so I created some custom functions. You can try them out.
def cleansed(s: str):
""":param s: String to be cleansed"""
assert s is not (None or "")
# return trimmed(s.replace('"', '').replace("'", ""))
return trimmed(s)
def trimmed(s: str):
""":param s: String to be cleansed"""
assert s is not (None or "")
ss = trim_start_and_end(s).replace(' ', ' ')
while ' ' in ss:
ss = ss.replace(' ', ' ')
return ss
def trim_start_and_end(s: str):
""":param s: String to be cleansed"""
assert s is not (None or "")
return trim_start(trim_end(s))
def trim_start(s: str):
""":param s: String to be cleansed"""
assert s is not (None or "")
chars = []
for c in s:
if c is not ' ' or len(chars) > 0:
chars.append(c)
return "".join(chars).lower()
def trim_end(s: str):
""":param s: String to be cleansed"""
assert s is not (None or "")
chars = []
for c in reversed(s):
if c is not ' ' or len(chars) > 0:
chars.append(c)
return "".join(reversed(chars)).lower()
s1 = ' b Beer '
s2 = 'Beer b '
s3 = ' Beer b '
s4 = ' bread butter Beer b '
cdd = trim_start(s1)
cddd = trim_end(s2)
clean1 = cleansed(s3)
clean2 = cleansed(s4)
print("\nStr: {0} Len: {1} Cleansed: {2} Len: {3}".format(s1, len(s1), cdd, len(cdd)))
print("\nStr: {0} Len: {1} Cleansed: {2} Len: {3}".format(s2, len(s2), cddd, len(cddd)))
print("\nStr: {0} Len: {1} Cleansed: {2} Len: {3}".format(s3, len(s3), clean1, len(clean1)))
print("\nStr: {0} Len: {1} Cleansed: {2} Len: {3}".format(s4, len(s4), clean2, len(clean2)))
If you want to trim specified number of spaces from left and right, you could do this:
def remove_outer_spaces(text, num_of_leading, num_of_trailing):
text = list(text)
for i in range(num_of_leading):
if text[i] == " ":
text[i] = ""
else:
break
for i in range(1, num_of_trailing+1):
if text[-i] == " ":
text[-i] = ""
else:
break
return ''.join(text)
txt1 = " MY name is "
print(remove_outer_spaces(txt1, 1, 1)) # result is: " MY name is "
print(remove_outer_spaces(txt1, 2, 3)) # result is: " MY name is "
print(remove_outer_spaces(txt1, 6, 8)) # result is: "MY name is"
How do I remove leading and trailing whitespace from a string in Python?
So below solution will remove leading and trailing whitespaces as well as intermediate whitespaces too. Like if you need to get a clear string values without multiple whitespaces.
>>> str_1 = ' Hello World'
>>> print(' '.join(str_1.split()))
Hello World
>>>
>>>
>>> str_2 = ' Hello World'
>>> print(' '.join(str_2.split()))
Hello World
>>>
>>>
>>> str_3 = 'Hello World '
>>> print(' '.join(str_3.split()))
Hello World
>>>
>>>
>>> str_4 = 'Hello World '
>>> print(' '.join(str_4.split()))
Hello World
>>>
>>>
>>> str_5 = ' Hello World '
>>> print(' '.join(str_5.split()))
Hello World
>>>
>>>
>>> str_6 = ' Hello World '
>>> print(' '.join(str_6.split()))
Hello World
>>>
>>>
>>> str_7 = 'Hello World'
>>> print(' '.join(str_7.split()))
Hello World
As you can see this will remove all the multiple whitespace in the string(output is Hello World for all). Location doesn't matter. But if you really need leading and trailing whitespaces, then strip() would be find.
One way is to use the .strip() method (removing all surrounding whitespaces)
str = " Hello World "
str = str.strip()
**result: str = "Hello World"**
Note that .strip() returns a copy of the string and doesn't change the underline object (since strings are immutable).
Should you wish to remove all whitespace (not only trimming the edges):
str = ' abcd efgh ijk '
str = str.replace(' ', '')
**result: str = 'abcdefghijk'
I wanted to remove the too-much spaces in a string (also in between the string, not only in the beginning or end). I made this, because I don't know how to do it otherwise:
string = "Name : David Account: 1234 Another thing: something "
ready = False
while ready == False:
pos = string.find(" ")
if pos != -1:
string = string.replace(" "," ")
else:
ready = True
print(string)
This replaces double spaces in one space until you have no double spaces any more

Categories

Resources