Regular expression for word with specific prefix/suffix - python

i want to match the word only if the word is surrounded with a maximum of 1 wild character on either side followed by space or nothing on either side. for example I want ring to match 'ring' , ' ring' , ' tring', 'ring ', ' ringt', '' ringt ', ' ring ', 'tringt ', 'tringt '
but not:
'ttring', 'ringttt', 'ttringtt'
so far I have:
[?\s\S]ring[?\s\S][?!\s]
any suggestions?

If i understand correctly, this should do:
(?:^|\s)\S?ring\S?(?:\s|$)
(?:^|\s) - this non-capturing group makes sure that the pattern is preceded by a whitespace or at the beginning
\S? matches zero or one non-whitespace character
ring matches literal ring
(?:\s|$) - the zero width positive lookahead makes sure the match is preceded by a space or is at the end
Example:
In [92]: l = ['ring ', ' ringt', ' ringt ', ' ring ', \
'tringt ', 'tringt ', 'ttring', 'ringttt', 'ttringtt']
In [93]: list(filter(lambda s: re.search(r'(?:^|\s)\S?ring\S?(?:\s|$)', s), l))
Out[93]: ['ring ', ' ringt', ' ringt ', ' ring ', 'tringt ', 'tringt ']

Related

empty space after stopwords removal and lemmatisation

The text looks like this before processing
0 [It's, good, for, beginners] positive
1 [I, recommend, this, starter, Ukulele, kit., I... positive
After preprocessing with stopword removal and lemmatisation
nlp = spacy.load('en', disable=['ner', 'parser']) # disabling Named Entity Recognition for speed
def cleaning(doc):
txt = [token.lemma_ for token in doc if not token.is_stop]
if len(txt) > 2:
return ' '.join(txt)
brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in df3['reviewText'])
txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000, n_threads=-1)]
the result came like this
0 ' good ' ' ' ' beginner ' positive
1 ' ' ' recommend ' ' ' ' starter ' ' ukulele ... positive
As you can see, there are lots of ' ' in the result, what caused this? I'm assuming it's the return ' '.join(txt) and re.sub("[^A-Za-z']+", ' ' that caused it, but if I removed the space or use return (txt), it simply won't remove any stopword or carry out lemmatisation.
Will these empty space cause troubles, or are they necessary, because I'm doing bigram and word2vec afterwards.
How can I fix it and have the result returned as ' recommend ' ' starter ' ' ukulele ' ' kit ' ' need ' ' learn ' ' ukulele '?

Python: match and replace all whitespaces at the beginning of each line

I need to convert text like this:
' 1 white space before string'
' 2 white spaces before string'
' 3 white spaces before string'
Into a:
' 1 white space before string'
' 2 white spaces before string'
' 3 white spaces before string'
Whitespaces between words and at the end of the line should not be matched, only at the beginning. Also, no need to match tabs. Big thx for help
Use re.sub with a callback that performs the actual replacement:
import re
list_of_strings = [...]
p = re.compile('^ +')
for i, l in enumerate(list_of_strings):
list_of_strings[i] = p.sub(lambda x: x.group().replace(' ', ' '), l)
print(list_of_strings)
[' 1 white space before string',
' 2 white spaces before string',
' 3 white spaces before string'
]
The pattern used here is '^ +' and will search for, and replace whitespaces as long as they're at the start of your string.
If you know it's just spaces as leading whitespace, you could do something like this:
l = ' ' * (len(l) - len(l.lstrip())) + l.lstrip()
Not the most efficient though. This would be a bit better:
stripped = l.strip()
l = ' ' * (len(l) - len(stripped)) + stripped
print(l)
It's one way to do it without the re overhead.
For example:
lines = [
' 1 white space before string',
' 2 white spaces before string',
' 3 white spaces before string',
]
for l in lines:
stripped = l.strip()
l = ' ' * (len(l) - len(stripped)) + stripped
print(l)
Output:
1 white space before string
2 white spaces before string
3 white spaces before string

how to remove spaces in a list that has a specific character

how do you get a list to fix the spaces in the list m.
m = ['m, a \n', 'l, n \n', 'c, l\n']
for i in m:
if (' ') in i:
i.strip(' ')
I got:
'm, a \n'
'l, n \n'
'c, l\n'
and I want it to return:
['m, a\n', 'l, n\n', 'c, l\n']
The strip() method will strip all the characters from the end of the string. In your case, strip starts at the end of your string, encounters a '\n' character, and exits.
It seems a little unclear what you are trying to do, but I will assume that you are looking to clear out any white space between the last non-whitespace character of your string and the newline. Correct me if I'm wrong.
There are many ways to do this, and this may not be the best, but here is what I came up with:
m = ['This, is a string. \n', 'another string! \n', 'final example\n ']
m = map(lambda(x): x.rstrip() + '\n' if x[-1] == '\n' else x.rstrip(' '), m)
print(m)
['This, is a string.\n', 'another string!\n', 'final example\n']
Here I use the built in map function iterate over each list element and remove all white space from the end (rstrip() instead of strip() which does both the start and end) of the string, and add in a new line if there was one present in the original string.
Your code wouldn't be useful in a script; you are just seeing the REPL displaying the result of the expression i.strip(' '). In a script, that value would just be ignored.
To create a list, use a list comprehension:
result = [i.strip(' ') for i in m if ' ' in i]
Note, however, strip only removes the requested character from either end; in your data, the space precedes the newline. You'll need to do something like removing the newline as well, then put it back:
result = ["%s\n" % i.strip() for i in m if ' ' in i]
You can use regex:
import re
m = ['m, a \n', 'l, n \n', 'c, l\n']
final_m = [re.sub('(?<=[a-zA-Z])\s+(?=\n)', '', i) for i in m]
Output:
['m, a\n', 'l, n\n', 'c, l\n']
Quick and dirty:
m = [x.replace(' \n', '\n') for x in m]
If you know that only one space goes before the '\n'

How to reduce whitespace in Python? [duplicate]

This question already has answers here:
Is there a simple way to remove multiple spaces in a string?
(27 answers)
Closed 6 years ago.
How do I reduce whitespace in Python from
test = ' Good ' to single whitespace test = ' Good '
I have tried define this function but when I try to test = reducing_white(test) it doesn't work at all, does it have to do with the function return or something?
counter = []
def reducing_white(txt):
counter = txt.count(' ')
while counter > 2:
txt = txt.replace(' ','',1)
counter = txt.count(' ')
return txt
Here is how I solved it:
def reduce_ws(txt):
ntxt = txt.strip()
return ' '+ ntxt + ' '
j = ' Hello World '
print(reduce_ws(j))
OUTPUT:
' Hello World '
You need to use regular expressions:
import re
re.sub(r'\s+', ' ', test)
>>>> ' Good '
test = ' Good Sh ow '
re.sub(r'\s+', ' ', test)
>>>> ' Good Sh ow '
r'\s+' matches all multiple whitespace characters, and replaces the entire sequence with a ' ' i.e. a single whitespace character.
This solution is fairly powerful and will work on any combination of multiple spaces.

Regex should handle whitespace including newline differently

My goal is to make a regex that can handle 2 situations:
Multiple whitespace including one or more newlines in any order should become a single newline
Multiple whitespace excluding any newline should become a space
The unorderedness combined with the different cases for newline and no newline is what makes this complex.
What is the most efficient way to do this?
E.g.
' \n \n \n a' # --> '\na'
' \t \t a' # --> ' a'
' \na\n ' # --> '\na\n'
Benchmark:
s = ' \n \n \n a \t \t a \na\n '
n_times = 1000000
------------------------------------------------------
change_whitespace(s) - 5.87 s
change_whitespace_2(s) - 3.51 s
change_whitespace_3(s) - 3.93 s
n_times = 100000
------------------------------------------------------
change_whitespace(s * 100) - 27.9 s
change_whitespace_2(s * 100) - 16.8 s
change_whitespace_3(s * 100) - 19.7 s
(Assumes Python can do regex replace with callback function)
You could use some callback to see what the replacement needs to be.
Group 1 matches, replace with space.
Group 2 matches, replace with newline
(?<!\s)(?:([^\S\r\n]+)|(\s+))(?!\s)
(?<! \s ) # No whitespace behind
(?:
( [^\S\r\n]+ ) # (1), Non-linebreak whitespace
|
( \s+ ) # (2), At least 1 linebreak
)
(?! \s ) # No whitespace ahead
This replaces the whitespace that contains a newline with a single newline, then replaces the whitespace that doesn't contain a newline with a single space.
import re
def change_whitespace(string):
return re.sub('[ \t\f\v]+', ' ', re.sub('[\s]*[\n\r]+[\s]*', '\n', string))
Results:
>>> change_whitespace(' \n \n \n a')
'\na'
>>> change_whitespace(' \t \t a')
' a'
>>> change_whitespace(' \na\n ')
'\na\n'
Thanks to #sln for reminding me of regex callback functions:
def change_whitespace_2(string):
return re.sub('\s+', lambda x: '\n' if '\n' in x.group(0) else ' ', string)
Results:
>>> change_whitespace_2(' \n \n \n a')
'\na'
>>> change_whitespace_2(' \t \t a')
' a'
>>> change_whitespace_2(' \na\n ')
'\na\n'
And here's a function with #sln's expression:
def change_whitespace_3(string):
return re.sub('(?<!\s)(?:([^\S\r\n]+)|(\s+))(?!\s)', lambda x: ' ' if x.group(1) else '\n', string)
Results:
>>> change_whitespace_3(' \n \n \n a')
'\na'
>>> change_whitespace_3(' \t \t a')
' a'
>>> change_whitespace_3(' \na\n ')
'\na\n'

Categories

Resources