Python: Best practice for dynamically constructing regex [duplicate]

Python: Best practice for dynamically constructing regex [duplicate] - python

This question already has answers here:
Escaping regex string
(4 answers)
Closed 9 months ago.
I have a simple function to remove a "word" from some text:
def remove_word_from(word, text):
if not text or not word: return text
rec = re.compile(r'(^|\s)(' + word + ')($|\s)', re.IGNORECASE)
return rec.sub(r'\1\3', text, 1)
The problem, of course, is that if word contains characters such as "(" or ")" things break, and it generally seems unsafe to stick a random word in the middle of a regex.
What's best practice for handling cases like this? Is there a convenient, secure function I can call to escape "word" so it's safe to use?

You can use re.escape(word) to escape the word.

Unless you're forced to use regexps, couldn't you use instead the replace method for strings ?
text = text.replace(word, '')
This allows you to get rid of punctuation issues.

Write a sanitizer function and pass word through that first.
def sanitize(word):
def literalize(wd, escapee):
return wd.replace(escapee, "\\%s"%escapee)
return reduce(literalize, "()[]*?{}.+|", word)
def remove_word_from(word, text):
if not text or not word: return text
rec = re.compile(r'(^|\s)(' + sanitize(word) + ')($|\s)', re.IGNORECASE)
return rec.sub(r'\1\3', text, 1)

Related

How to replace the puntuation marks in words with effective code? [duplicate]

This question already has answers here:
Best way to strip punctuation from a string
(32 answers)
Closed 6 years ago.
I have been working on a file which has lot of puntuations and we need to neglect the puntuations so we can count the actual length of words.
Example:
Is this stack overflow! ---> Is this stack overflow
While doing this I did wrote a lot of cases for each and every punctuation which is there which made my code work slow.So I was looking for some effective way to implement the same using a module or function.
Code snippet :
with open(file_name,'r') as f:
for line in f:
for word in line.split():
#print word
'''
Handling Puntuations
'''
word = word.replace('.','')
word = word.replace(',','')
word = word.replace('!','')
word = word.replace('(','')
word = word.replace(')','')
word = word.replace(':','')
word = word.replace(';','')
word = word.replace('/','')
word = word.replace('[','')
word = word.replace(']','')
word = word.replace('-','')
So form this logic I have written this, so is there any way to minimize this?

This question is a "classic", but a lot of answers don't work in Python 3 because the maketrans function has been removed from Python 3. A Python 3-compliant solution is:
use string.punctuation to get the list and str.translate to remove them
import string
"hello, world !".translate({ord(k):"" for k in string.punctuation})
results in:
'hello world '
the argument of translate is (in Python 3) a dictionary. Key is the ASCII code of the character, and value is the replacement character. I created it using a dictionary comprehension.

You can use regular expression to replace from a character class as
>>> import re
>>> re.sub(r'[]!,:)([/-]', '', string)
'Is this stack overflow'
[]!,:)([/-] A character class which matches ] or ! or , or etc. Replace it with ''.

Replace command condition for replacing strings [duplicate]

This question already has answers here:
Replace all the occurrences of specific words
(4 answers)
Find substring in string but only if whole words?
(8 answers)
Closed 6 years ago.
Want to replace a certain words in a string but keep getting the followinf result:
String: "This is my sentence."
User types in what they want to replace: "is"
User types what they want to replace word with: "was"
New string: "Thwas was my sentence."
How can I make sure it only replaces the word "is" instead of any string of the characters it finds?
Code function:
import string
def replace(word, new_word):
new_file = string.replace(word, new_word[1])
return new_file
Any help is much appreciated, thank you!

using regular expression word boundary:
import re
print(re.sub(r"\bis\b","was","This is my sentence"))
Better than a mere split because works with punctuation as well:
print(re.sub(r"\bis\b","was","This is, of course, my sentence"))
gives:
This was, of course, my sentence
Note: don't skip the r prefix, or your regex would be corrupt: \b would be interpreted as backspace.

A simple but not so all-round solution (as given by Jean-Francios Fabre) without using regular expressions.
' '.join(x if x != word else new_word for x in string.split())

How Do I Remove A Vowel From String [duplicate]

This question already has answers here:
Remove specific characters from a string
(2 answers)
Closed 8 years ago.
I'm trying to remove all the vowels from a string, using a function that is passed an argument called "text". I'm not sure if this is the most efficient way to code this, but it's all I could come up with. I'm not sure how to tell the function to check if "text" has any of the characters from the "vowels" list, and if so, to remove it. I thought replacing it with a space would do in the .replace() function would do the trick, but apparently not. The code is supposed to remove lower AND uppercase vowels, so I'm not sure if making them all lowercase is even acceptable. Thanks in advance.
def anti_vowel(text): #Function Definition
vowels = ['a','e','i','o','u'] #Letters to filter out
text = text.lower() #Convert string to lower case
for char in range(0,len(text)):
if char == vowels[0,4]:
text = text.replace(char,"")
else:
return text

Pretty simple using str.translate() (https://docs.python.org/2.7/library/stdtypes.html#str.translate)
return text.translate(None, 'aeiouAEIOU')

Python's Replace has you specify a substring to replace, not a position in the string. What you would want to do instead is
for char in range(0,5):
text = text.replace(vowels[char],"")
return text
UPDATED BASED ON COMMENT:
or you could do
for char in vowels:
text = text.replace(char,"");
return text;

Use the sub() function (https://docs.python.org/2/library/re.html#re.sub):
re.sub('[aeiou]', '', text)

Change your function to loop over the vowels list like this:
def anti_vowel(text): #Function Definition
vowels = ['a','e','i','o','u'] #Letters to filter out
text = text.lower() #Convert string to lower case
for vowel in vowels:
text = text.replace(vowel,"")
return text
This simply iterates over the vowels and replaces all occurrences of each vowel.

Remove a prefix from a string [duplicate]

This question already has answers here:
How to remove the left part of a string?
(21 answers)
Closed 6 years ago.
I am trying to do the following, in a clear pythonic way:
def remove_prefix(str, prefix):
return str.lstrip(prefix)
print remove_prefix('template.extensions', 'template.')
This gives:
xtensions
Which is not what I was expecting (extensions). Obviously (stupid me), because I have used lstrip wrongly: lstrip will remove all characters which appear in the passed chars string, not considering that string as a real string, but as "a set of characters to remove from the beginning of the string".
Is there a standard way to remove a substring from the beginning of a string?

As noted by #Boris-Verkhovskiy and #Stefan, on Python 3.9+ you can use
text.removeprefix(prefix)
In older versions you can use with the same behavior:
def remove_prefix(text, prefix):
if text.startswith(prefix):
return text[len(prefix):]
return text # or whatever

Short and sweet:
def remove_prefix(text, prefix):
return text[text.startswith(prefix) and len(prefix):]

What about this (a bit late):
def remove_prefix(s, prefix):
return s[len(prefix):] if s.startswith(prefix) else s

regex solution (The best way is the solution by #Elazar this is just for fun)
import re
def remove_prefix(text, prefix):
return re.sub(r'^{0}'.format(re.escape(prefix)), '', text)
>>> print remove_prefix('template.extensions', 'template.')
extensions

I think you can use methods of the str type to do this. There's no need for regular expressions:
def remove_prefix(text, prefix):
if text.startswith(prefix): # only modify the text if it starts with the prefix
text = text.replace(prefix, "", 1) # remove one instance of prefix
return text

def remove_prefix(s, prefix):
if s.startswith(prefix):
return s[len(prefix):]
else:
return s

I want to remove '\' from a string in python [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to refer to “\” sign in python string
I've quite large string data in which I've to remove all characters other than A-Z,a-z and 0-9
I'm able to remove almost every character but '\' is a problem.
every other character is removed but '\' is making problem
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
reps = {' ':'-','.':'-','"':'-',',':'-','/':'-',
'<':'-',';':'-',':':'-','*':'-','+':'-',
'=':'-','_':'-','?':'-','%':'-','!':'-',
'$':'-','(':'-',')':'-','\#':'-','[':'-',
']':'-','\&':'-','#':'-','\W':'-','\t':'-'}
x.name = x.name.lower()
x1 = replace_all(x.name,reps)

I've quite large string data in which I've to remove all characters other than A-Z,a-z and 0-9
In other words, you want to keep only those characters.
The string class already provides a test "is every character a letter or number?", called .isalnum(). So, we can just filter with that:
>>> filter(str.isalnum, 'foo-bar\\baz42')
'foobarbaz42'

If you have a string:
a = 'hi how \\are you'
you can remove it by doing:
a.replace('\\','')
>'hi how are you'
If you have a specific context where you are having trouble, I recommend posting a bit more detail.

birryee is correct, you need to escape the backslash with a second backslash.

to remove all characters other than A-Z, a-z and 0-9
Instead of trying to list all the characters you want to remove (that would take a long time), use a regular expression to specify those characters you wish to keep:
import re
text = re.sub('[^0-9A-Za-z]', '-', text)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Best practice for dynamically constructing regex [duplicate] - python

You can use re.escape(word) to escape the word.

Unless you're forced to use regexps, couldn't you use instead the replace method for strings ? text = text.replace(word, '') This allows you to get rid of punctuation issues.

Related

How to replace the puntuation marks in words with effective code? [duplicate]

Replace command condition for replacing strings [duplicate]

How Do I Remove A Vowel From String [duplicate]

Remove a prefix from a string [duplicate]

I want to remove '\' from a string in python [duplicate]

Categories

Resources