Regex to remove specific words in python - python

I want to do the some manipulation using regex in python.
So input is +1223,+12_remove_me,+222,+2223_remove_me
and
output should be +1223,+222
Output should only contain comma seperated words which don't contain _remove_me and only one comma between each word.
Note: REGEX which I tried \+([0-9|+]*)_ , \+([0-9|+]*) and some other combination using which I did not get required output.
Note 2 I can't use loop, need to do that without loop with regex only.

Your regex seems incomplete, but you were on the right track. Note that a pipe symbol inside a character class is treated as a literal and your [0-9|+] matches a digit or a | or a + symbols.
You may use
,?\+\d+_[^,]+
See the regex demo
Explanation:
,? - optional , (if the "word" is at the beginning of the string, it should be optional)
\+ - a literal +
\d+ - 1+ digits
_ - a literal underscore
[^,]+ - 1+ characters other than ,
Python demo:
import re
p = re.compile(r',?\+\d+_[^,]+')
test_str = "+1223,+12_remove_me,+222,+2223_remove_me"
result = p.sub("", test_str)
print(result)
# => +1223,+222

A non-regex approach would involve using str.split() and excluding items ending with _remove_me:
>>> s = "+1223,+12_remove_me,+222,+2223_remove_me"
>>> items = [item for item in s.split(",") if not item.endswith("_remove_me")]
>>> items
['+1223', '+222']
Or, if _remove_me can be present anywhere inside each item, use not in:
>>> items = [item for item in s.split(",") if "_remove_me" not in item]
>>> items
['+1223', '+222']
You can then use str.join() to join the items into a string again:
>>> ",".join(items)
'+1223,+222'

In your case you need regex with negotiation
[^(_remove_me)]
Demo

You could perform this without a regex, just using string manipulation. The following can be written as a one-liner, but has been expanded for readability.
my_string = '+1223,+12_remove_me,+222,+2223_remove_me' #define string
my_list = my_string.split(',') #create a list of words
my_list = [word for word in my_list if '_remove_me' not in word] #stop here if you want a list of words
output_string = ','.join(my_list)

Related

python findall regex expression

I got a long string and i need to find words which contain the character 'd' and afterwards the character 'e'.
l=[" xkn59438","yhdck2","eihd39d9","chdsye847","hedle3455","xjhd53e","45da","de37dp"]
b=' '.join(l)
runs1=re.findall(r"\b\w?d.*e\w?\b",b)
print(runs1)
\b is the boundary of the word, which follows with any char (\w?) and etc.
I get an empty list.
You can massively simplify your solution by applying a regex based search on each string individually.
>>> p = re.compile('d.*e')
>>> list(filter(p.search, l))
Or,
>>> [x for x in l if p.search(x)]
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Why didn't re.findall work? You were searching one large string, and your greedy match in the middle was searching across strings. The fix would've been
>>> re.findall(r"\b\S*d\S*e\S*", ' '.join(l))
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Using \S to match anything that is not a space.
You can filter the result :
import re
l=[" xkn59438","yhdck2","eihd39d9","chdsye847","hedle3455","xjhd53e","45da","de37dp"]
pattern = r'd.*?e'
print(list(filter(lambda x:re.search(pattern,x),l)))
output:
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Something like this maybe
\b\w*d\w*e\w*
Note that you can probably remove the word boundary here because
the first \w guarantees a word boundary before.
The same \w*d\w*e\w*

Removing specific duplicated characters from a string in Python

How i can delete specific duplicated characters from a string only if they goes one after one in Python? For example:
A have string
string = "Hello _my name is __Alex"
I need to delete duplicate _ only if they goes one after one __ and get string like this:
string = "Hello _my name is _Alex"
If i use set i got this:
string = "_yoiHAemnasxl"
(Big edit: oops, I missed that you only want to de-deuplicate certain characters and not others. Retrofitting solutions...)
I assume you have a string that represents all the characters you want to de-duplicate. Let's call it to_remove, and say that it's equal to "_.-". So only underscores, periods, and hyphens will be de-duplicated.
You could use a regex to match multiple successive repeats of a character, and replace them with a single character.
>>> import re
>>> to_remove = "_.-"
>>> s = "Hello... _my name -- is __Alex"
>>> pattern = "(?P<char>[" + re.escape(to_remove) + "])(?P=char)+"
>>> re.sub(pattern, r"\1", s)
'Hello. _my name - is _Alex'
Quick breakdown:
?P<char> assigns the symbolic name char to the first group.
we put to_remove inside the character matching set, []. It's necessary to call re.escape because hyphens and other characters may have special meaning inside the set otherwise.
(?P=char) refers back to the character matched by the named group "char".
The + matches one or more repetitions of that character.
So in aggregate, this means "match any character from to_remove that appears more than once in a row". The second argument to sub, r"\1", then replaces that match with the first group, which is only one character long.
Alternative approach: write a generator expression that takes only characters that don't match the character preceding them.
>>> "".join(s[i] for i in range(len(s)) if i == 0 or not (s[i-1] == s[i] and s[i] in to_remove))
'Hello. _my name - is _Alex'
Alternative approach #2: use groupby to identify consecutive identical character groups, then join the values together, using to_remove membership testing to decide how many values should be added..
>>> import itertools
>>> "".join(k if k in to_remove else "".join(v) for k,v in itertools.groupby(s, lambda c: c))
'Hello. _my name - is _Alex'
Alternative approach #3: call re.sub once for each member of to_remove. A bit expensive if to_remove contains a lot of characters.
>>> for c in to_remove:
... s = re.sub(rf"({re.escape(c)})\1+", r"\1", s)
...
>>> s
'Hello. _my name - is _Alex'
Simple re.sub() approach:
import re
s = "Hello _my name is __Alex aa"
result = re.sub(r'(\S)\1+', '\\1', s)
print(result)
\S - any non-whitespace character
\1+ - backreference to the 1st parenthesized captured group (one or more occurrences)
The output:
Helo _my name is _Alex a

Remove a character in string if it doesn't belong to a group of matching pattern in Python

If I have a string such that it contains many words. I want to remove the closing parenthesis if the word in the string doesn't start with _.
Examples input:
this is an example to _remove) brackets under certain) conditions.
Output:
this is an example to _remove) brackets under certain conditions.
How can I do that without splitting the words using re.sub?
re.sub accepts a callable as the second parameter, which comes in handy here:
>>> import re
>>> s = 'this is an example to _remove) brackets under certain) conditions.'
>>> re.sub('(\w+)\)', lambda m: m.group(0) if m.group(0).startswith('_') else m.group(1), s)
'this is an example to _remove) brackets under certain conditions.'
I wouldn't use regex here when a list comprehension can do it.
result = ' '.join([word.rstrip(")") if not word.startswith("_") else word
for word in words.split(" ")])
If you have possible input like:
someword))
that you want to turn into:
someword)
Then you'll have to do:
result = ' '.join([word[:-1] if word.endswith(")") and not word.startswith("_") else word
for word in words.split(" ")])

regex for repeating words in a string in Python

I have a good regexp for replacing repeating characters in a string. But now I also need to replace repeating words, three or more word will be replaced by two words.
Like
bye! bye! bye!
should become
bye! bye!
My code so far:
def replaceThreeOrMoreCharachetrsWithTwoCharacters(string):
# pattern to look for three or more repetitions of any character, including newlines.
pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
return pattern.sub(r"\1\1", string)
Assuming that what is called "word" in your requirements is one or more non-whitespaces characters surrounded by whitespaces or string limits, you can try this pattern:
re.sub(r'(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)', r'\1', s)
You could try the below regex also,
(?<= |^)(\S+)(?: \1){2,}(?= |$)
Sample code,
>>> import regex
>>> s = "hi hi hi hi some words words words which'll repeat repeat repeat repeat repeat"
>>> m = regex.sub(r'(?<= |^)(\S+)(?: \1){2,}(?= |$)', r'\1 \1', s)
>>> m
"hi hi some words words which'll repeat repeat"
DEMO
I know you were after a regular expression but you could use a simple loop to achieve the same thing:
def max_repeats(s, max=2):
last = ''
out = []
for word in s.split():
same = 0 if word != last else same + 1
if same < max: out.append(word)
last = word
return ' '.join(out)
As a bonus, I have allowed a different maximum number of repeats to be specified (the default is 2). If there is more than one space between each word, it will be lost. It's up to you whether you consider that to be a bug or a feature :)
Try the following:
import re
s = your string
s = re.sub( r'(\S+) (?:\1 ?){2,}', r'\1 \1', s )
You can see a sample code here: http://codepad.org/YyS9JCLO
def replaceThreeOrMoreWordsWithTwoWords(string):
# Pattern to look for three or more repetitions of any words.
pattern = re.compile(r"(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)", re.DOTALL)
return pattern.sub(r"\1", string)

How to find a non-alphanumeric character and move it to the end of a string in Python

I have the following string:
"string.isnotimportant"
I want to find the dot (it could be any non-alphanumeric character), and move it to the end of the string.
The result should look like:
"stringisnotimportant."
I am looking for a regular expression to do this job.
import re
inp = "string.isnotimportant"
re.sub('(\w*)(\W+)(\w*)', '\\1\\3\\2', inp)
>>> import re
>>> string = "string.isnotimportant"
#I explain a bit about this at the end
>>> regex = '\w*(\W+)\w*' # the brackets in the regex mean that item, if matched will be stored as a group
#in order to understand the re module properly, I think your best bet is to read some docs, I will link you at the end of the post
>>> x = re.search(regex, string)
>>> x.groups() #remember the stored group above? well this accesses that group.
#if there were more than one group above, there would be more items in the tuple
('.',)
#here I reassign the variable string to a modified version where the '.' is replaced with ''(nothing).
>>> string = string.replace('.', '')
>>> string += x.groups()[0] # here I basically append a letter to the end of string
The += operator appends a character to the end of a string. Since strings don't have an .append method like lists do, this is a handy feature. x.groups()[0] refers to the first item(only item in this case) of the tuple above.
>>> print string
"stringisnotimportant."
about the regex:
"\w" Matches any alphanumeric character and the underscore: a through z, A through Z, 0 through 9, and '_'.
"\W" Matches any non-alphanumeric character. Examples for this include '&', '$', '#', etc.
https://developers.google.com/edu/python/regular-expressions?csw=1
http://python.about.com/od/regularexpressions/a/regexprimer.htm

Categories

Resources