Can string replace be written in list comprehension? - python

I have a text and a list.
text = "Some texts [remove me] that I want to [and remove me] replace"
remove_list = ["[remove me]", "[and remove me]"]
I want to replace all elements from list in the string. So, I can do this:
for element in remove_list:
text = text.replace(element, '')
I can also use regex.
But can this be done in list comprehension or any single liner?

You can use functools.reduce:
from functools import reduce
text = reduce(lambda x, y: x.replace(y, ''), remove_list, text)
# 'Some texts that I want to replace'

I would do this with re.sub to remove all the substrings in one pass:
>>> import re
>>> regex = '|'.join(map(re.escape, remove_list))
>>> re.sub(regex, '', text)
'Some texts that I want to replace'
Note that the result has two spaces instead of one where each part was removed. If you want each occurrence to leave just one space, you can use a slightly more complicated regex:
>>> re.sub(r'\s*(' + regex + r')', '', text)
'Some texts that I want to replace'
There are other ways to write similar regexes; this one will remove the space preceding a match, but you could alternatively remove the space following a match instead. Which behaviour you want will depend on your use-case.

You can do this with a regex by building a regex from an alternation of the words to remove, taking care to escape the strings so that the [ and ] in them don't get treated as special characters:
import re
text = "Some texts [remove me] that I want to [and remove me] replace"
remove_list = ["[remove me]", "[and remove me]"]
regex = re.compile('|'.join(re.escape(r) for r in remove_list))
text = regex.sub('', text)
print(text)
Output:
Some texts that I want to replace
Since this may result in double spaces in the result string, you can remove them with replace e.g.
text = regex.sub('', text).replace(' ', ' ')
Output:
Some texts that I want to replace

Related

Extracting only characters from list items REGEX

I am practising regex and I would like to extract only characters from this list
text=['aQx12', 'aub 6 5']
I want to ignore the numbers and the white spaces and only keep the letters. The desired output is as follows
text=['aQx', 'aub']
I tried the below code but it is not working properly
import re
text=['aQx12', 'aub 6 5']
r = re.compile("\D")
newlist = list(filter(r.match, text))
print(newlist)
Can someone tell me what I need to fix
You're testing the entire string, not individual characters. You need to filter the characters in the strings.
Also, \D matches anything that isn't a digit, so it will include whitespace in the result. You want to match only letters, which is [a-z].
r = re.compile(r'[a-z]', re.I)
newlist = ["".join(filter(r.match, s)) for s in text]
You can use re.findall then join the matches instead of using re.match and filter, also use [a-zA-Z] to get only the alphabets.
>>> [''.join(re.findall('[a-zA-Z]', t)) for t in text]
['aQx', 'aub']
You can do this without a regex as well:
from string import ascii_letters
text=['aQx12', 'aub 6 5']
>>> [''.join([c for c in sl if c in ascii_letters]) for sl in text]
['aQx', 'aub']
You can remove any chars other than letters in a list comprehension.
No regex solution:
print( [''.join(filter(str.isalpha, s)) for s in ['aQx12', 'aub 6 5']] )
See the Python demo. Here is a regex based demo:
import re
text=['aQx12', 'aub 6 5']
newlist = [re.sub(r'[^a-zA-Z]+', '', x) for x in text]
print(newlist)
# => ['aQx', 'aub']
See the Python demo
If you need to handle any Unicode letters, use
re.sub(r'[\W\d_]+', '', x)
See the regex demo.

Using python and regex to remove repeated character strings within a word

I'm using pandas in python to clean and prepare some data by sorting words in a string alphabetically and removing repeated character strings within a word
i.e. "informative text about texting" would become "about informative ing text"
My attempt (below) sorts the words alphabetically and removes duplicate words, but does not remove duplicate words with additional characters either side.
df = pd.DataFrame({'raw':['informative text about texting','some more text text']})
df['cleaned'] = df['raw'].str.split().apply(lambda x: sorted(OrderedDict.fromkeys(x).keys())).str.join(' ')
df.to_dict()
>>> {'raw': {0: 'informative text about texting', 1: 'some more text text'},
'cleaned': {0: 'about informative text texting', 1: 'more some text'}}
Is there a way to do this using regex?
Thanks!
Sure, there is a way to do this using regex, but it may not entirely be necessary. One may opt for something like this:
string = "informative text about texting"
new_string = string.replace("text", "").replace(" ", " ")
Above, we replace "text" with nothing and then replace a double space with a single space. We need to replace double spaces because when a string contains "text" with a space on either side, it will remove "text" and leave two spaces.
Using regex:
string = "informative text about texting"
new_string = re.sub(r"\stext|text", "", string)
This regex looks for a space that precedes "text" (\stext) and then uses the | as an or operator followed by text to also match just "text".
Edit
Let's take two examples:
"foo bar baz bar"
"foo bar baz barr"
If given the first string, the output should be "foo bar baz" and if given the second string, the output should be "foo bar baz r"
So, how can we accomplish this? Firstly, we need to consider how we can remove duplicates in a string. In this example, I use set to do this. To remove basic duplicates like "bar bar" (not complex duplicates like "bar barr"):
unique = set(string.split())
Then, we can join unique using join so that we are able to regex it , like so:
new = " ".join(unique)
Then, we can loop through each word in unique and regex the entire string with each word so that we can remove the complex duplicates I mentioned above:
for word in unique:
pattern = fr"({word}(?=[^\s]))|((?<=[^\s]){word})"
new = re.sub(pattern, "", new)
Now, the entire script should look like this:
unique = set(string.split())
new = " ".join(unique)
for word in unique:
pattern = fr"({word}(?=[^\s]))|((?<=[^\s]){word})"
new = re.sub(pattern, "", new)
Regex Explanation
({word}(?=[^\s]))|((?<=[^\s]){word})
This regex uses both a lookahead and lookbehind. You can ask yourself this question: what criteria has to be met for the string of characters to be replaced. Well, a word is separated by spaces. So, using the lookahead, we can look for strings of characters that do not precede a space:
({word}](?=[^\s]))
The [^\s] matches characters that are not a space. We can then use the lookbehind in the same manner so that the regex matches the strings of characters that do not follow a space:
((?<=[^\s]){word})
We then join them with the or operator (\) to complete the pattern:
({word}(?=[^\s]))|((?<=[^\s]){word})

Why can I not use re.sub to replace a group?

My goal is to find a group in a string using regex and replace it with a space.
The group I am looking to find is a group of symbols only when they fall between strings. When I use re.findall() it works exactly as expected
word = 'This##Is # A # Test#'
print(word)
re.findall(r"[a-zA-Z\s]*([\$\#\%\!\s]*)[a-zA-Z]",word)
>>> ['##', '# ', '# ', '']
But when I use re.sub(), instead of replacing the group, it replaces the entire regex.
x = re.sub(r"[a-zA-Z\s]*([\$\#\%\!\s]*)[a-zA-Z]",r' ',word)
print(x)
>>> ' #'
How can I use regular expressions to replace ONLY the group? The outcome I expect is:
'This Is A Test#'
First, there's no need to escape every "magic" character within a character class, [$#%!\s]* is equally fine and much more readable.
Second, matching (i.e. retrieving) is different from replacing and you could use backreferences to achieve your goal.
Third, if you only want to have # at the end, you could help yourself with a much easier expression:
(?:[\s#](?!\Z))+
Which would then need to be replaced by a space, see a demo on regex101.com.
In Python this could be:
import re
string = "This##Is # A # Test#"
rx = re.compile(r'(?:[\s#](?!\Z))+')
new_string = rx.sub(' ', string)
print(new_string)
# This Is A Test#
You can group the portions of the pattern you want to retain and use backreferences in your replacement string instead:
x = re.sub(r"([a-zA-Z\s]*)[\$\#\%\!\s]*([a-zA-Z])", r'\1 \2', word)
The problem is that your regex matches the wrong thing entirely.
x = re.sub(r'\b[$#%!\s]+\b', ' ', word)

Substring regex from characters to end of word

I looking for a regex term that will capture a subset of a string beginning with a a certain sequence of characters (http in my case)up until a whitespace.
I am doing the problem in python, working over a list of strings and replacing the 'bad' substring with ''.
The difficulty stems from the characters not necessarily beginning the words within the substring. Example below, with bold being the part I am looking to capture:
"Pasforcémenthttpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php merci httpswwwgooglecomsilvous "
Thank you
Use findall:
>>> text = '''Pasforcémenthttpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php merci httpswwwgooglecomsilvous '''
>>> import re
>>> re.findall(r'http\S+', text)
['httpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php', 'httpswwwgooglecomsilvous']
For substitution (if memory not an issue):
>>> rep = re.compile(r'http\S+')
>>> rep.sub('', text)
You can try this:
strings = [] #your list of strings goes here
import re
new_strings = [re.sub("https.*?php|https.*?$", '.', i) for i in strings]

Remove a character in string if it doesn't belong to a group of matching pattern in Python

If I have a string such that it contains many words. I want to remove the closing parenthesis if the word in the string doesn't start with _.
Examples input:
this is an example to _remove) brackets under certain) conditions.
Output:
this is an example to _remove) brackets under certain conditions.
How can I do that without splitting the words using re.sub?
re.sub accepts a callable as the second parameter, which comes in handy here:
>>> import re
>>> s = 'this is an example to _remove) brackets under certain) conditions.'
>>> re.sub('(\w+)\)', lambda m: m.group(0) if m.group(0).startswith('_') else m.group(1), s)
'this is an example to _remove) brackets under certain conditions.'
I wouldn't use regex here when a list comprehension can do it.
result = ' '.join([word.rstrip(")") if not word.startswith("_") else word
for word in words.split(" ")])
If you have possible input like:
someword))
that you want to turn into:
someword)
Then you'll have to do:
result = ' '.join([word[:-1] if word.endswith(")") and not word.startswith("_") else word
for word in words.split(" ")])

Categories

Resources