Removing specific duplicated characters from a string in Python - python

How i can delete specific duplicated characters from a string only if they goes one after one in Python? For example:
A have string
string = "Hello _my name is __Alex"
I need to delete duplicate _ only if they goes one after one __ and get string like this:
string = "Hello _my name is _Alex"
If i use set i got this:
string = "_yoiHAemnasxl"

(Big edit: oops, I missed that you only want to de-deuplicate certain characters and not others. Retrofitting solutions...)
I assume you have a string that represents all the characters you want to de-duplicate. Let's call it to_remove, and say that it's equal to "_.-". So only underscores, periods, and hyphens will be de-duplicated.
You could use a regex to match multiple successive repeats of a character, and replace them with a single character.
>>> import re
>>> to_remove = "_.-"
>>> s = "Hello... _my name -- is __Alex"
>>> pattern = "(?P<char>[" + re.escape(to_remove) + "])(?P=char)+"
>>> re.sub(pattern, r"\1", s)
'Hello. _my name - is _Alex'
Quick breakdown:
?P<char> assigns the symbolic name char to the first group.
we put to_remove inside the character matching set, []. It's necessary to call re.escape because hyphens and other characters may have special meaning inside the set otherwise.
(?P=char) refers back to the character matched by the named group "char".
The + matches one or more repetitions of that character.
So in aggregate, this means "match any character from to_remove that appears more than once in a row". The second argument to sub, r"\1", then replaces that match with the first group, which is only one character long.
Alternative approach: write a generator expression that takes only characters that don't match the character preceding them.
>>> "".join(s[i] for i in range(len(s)) if i == 0 or not (s[i-1] == s[i] and s[i] in to_remove))
'Hello. _my name - is _Alex'
Alternative approach #2: use groupby to identify consecutive identical character groups, then join the values together, using to_remove membership testing to decide how many values should be added..
>>> import itertools
>>> "".join(k if k in to_remove else "".join(v) for k,v in itertools.groupby(s, lambda c: c))
'Hello. _my name - is _Alex'
Alternative approach #3: call re.sub once for each member of to_remove. A bit expensive if to_remove contains a lot of characters.
>>> for c in to_remove:
... s = re.sub(rf"({re.escape(c)})\1+", r"\1", s)
...
>>> s
'Hello. _my name - is _Alex'

Simple re.sub() approach:
import re
s = "Hello _my name is __Alex aa"
result = re.sub(r'(\S)\1+', '\\1', s)
print(result)
\S - any non-whitespace character
\1+ - backreference to the 1st parenthesized captured group (one or more occurrences)
The output:
Helo _my name is _Alex a

Related

Python:Regex to remove more than N consecutive letters

lets say I have this string : Sayy Hellooooooo
if N = 2
I want the result to be (Using Regex): Sayy Helloo
Thank U in advance
Another option is to use re.sub with a callback:
N = 2
result = re.sub(r'(.)\1+', lambda m: m.group(0)[:N], your_string)
You could build the regex dynamically for a given n, and then call sub without callback:
import re
n = 2
regex = re.compile(rf"((.)\2{{{n-1}}})\2+")
s = "Sayy Hellooooooo"
print(regex.sub(r"\1", s)) # Sayy Helloo
Explanation:
{{: this double brace represents a literal brace in an f-string
{n-1} injects the value of n-1, so together with the additional (double) brace-wrap, this {{{n-1}}} produces {2} when n is 3.
The outer capture group captures the maximum allowed repetition of a character
The additional \2+ captures more subsequent occurrences of that same character, so these are the characters that need removal.
The replacement with \1 thus reproduces the allowed repetition, but omits the additional repetition of that same character.
You could use backreferences to mach the previous character. So (a|b)\1 would match aa or bb. In your case you would want probably any letter and any number of repetitions so ([a-zA-Z])\1{n,} for N repetitions. Then substitute it with one occurence using \1 again. So putting it all together:
import re
n=2
expression = r"([a-zA-Z])\1{"+str(n)+",}"
print(re.sub(expression,r"\1","hellooooo friiiiiend"))
# Outputs Hello friend
Attempt This Online!
Note this actually matches N+1 repetitions only, like your test cases. One item then N copies of it. If you want to match exactly N also subtract 1.
Remember to use r in front of regular expressions so you don't need to double escape backslashes.
Learn more about backreferences: https://www.regular-expressions.info/backref.html Learn more about repetition: https://www.regular-expressions.info/repeat.html
You need a regex that search for multiple occurence of the same char, that is done with (.)\1 (the \1 matches the group 1 (in the parenthesis))
To match
2 occurences : (.)\1
3 occurences : (.)\1\1 or (.)\1{2}
4 occurences : (.)\1\1\1 or (.)\1{3}
So you can build it with an f-string and the value you want (that's a bit ugly because you have literal brackets that needs to be escaped using double brackets, and inside that the bracket to allow the value itself)
def remove_letters(value: str, count: int):
return re.sub(rf"(.)\1{{{count}}}", "", value)
print(remove_letters("Sayy Hellooooooo", 1)) # Sa Heo
print(remove_letters("Sayy Hellooooooo", 2)) # Sayy Hello
print(remove_letters("Sayy Hellooooooo", 3)) # Sayy Hellooo
You may understand the pattern creation easier with that
r"(.)\1{" + str(count) + "}"
This seems to work:
When N=2: the regex pattern is compiled to : ((\w)\2{2,})
When N=3: the regex pattern is compiled to : ((\w)\2{3,})
Code:
import re
N = 2
p = re.compile(r"((\w)\2{" + str(N) + r",})")
text = "Sayy Hellooooooo"
matches = p.findall(text)
for match in matches:
text = re.sub(match[0], match[1]*N, text)
print(text)
Output:
Sayy Helloo
Note:
Also tested with N=3, N=4 and other text inputs.

python regex - replace newline (\n) to something else

I'm trying to convert multiple continuous newline characters followed by a Capital Letter to "____" so that I can parse them.
For example,
i = "Inc\n\nContact"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
In [25]: i
Out [25]: 'Inc____Contact'
This string works fine. I can parse them using ____ later.
However it doesn't work on this particular string.
i = "(2 months)\n\nML"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
Out [31]: '(2 months)____L'
It ate capital M.
What am I missing here?
EDIT To replace multiple continuous newline characters (\n) to ____, this should do:
>>> import re
>>> i = "(2 months)\n\nML"
>>> re.sub(r'(\n+)(?=[A-Z])', r'____', i)
'(2 months)____ML'
(?=[A-Z]) is to assert "newline characters followed by Capital Letter". REGEX DEMO.
Well let's take a look at your regex ([\n]+)([A-Z])+ - the first part ([\n]+) is fine, matching multiple occurences of a newline into one group (note - this wont match the carriage return \r). However the second part ([A-Z])+ leeds to your error it matches a single uppercase letter into a capturing group - multiple times, if there are multiple Uppercase letter, which will reset the group to the last matched uppercase letter, which is then used for the replace.
Try the following and see what happens
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
You could simply place the + inside the capturing group, so multiple uppercase letters are matched into it. You could also just leave it out, as it doesn't make a difference, how many of these uppercase letters follow.
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)([A-Z])', r"____\2", i)
If you want to replace any sequence of linebreaks, no matter what follows - drop the ([A-Z]) completely and try
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)', r"____", i)
You could also use ([\r\n]+) as pattern, if you want to consider carriage returns
Try:
import re
p = re.compile(ur'[\r?\n]')
test_str = u"(2 months)\n\nML"
subst = u"_"
result = re.sub(p, subst, test_str)
It will reduce string to
(2 months)__ML
See Demo

How to change a quantifier in a Regex based on a condition?

I would like to find words of length >= 1 which may contain a ' or a - within. Here is a test string:
a quake-prone area- (aujourd'hui-
In Python, I'm currently using this regex:
string = "a quake-prone area- (aujourd'hui-"
RE_WORDS = re.compile(r'[a-z]+[-\']?[a-z]+')
words = RE_WORDS.findall(string)
I would like to get this result:
>>> words
>>> [u'a', u'quake-prone', u'area', u"aujourd'hui"]
but I get this instead:
>>> words
>>> [u'quake-prone', u'area', u"aujourd'hui"]
Unfortunately, because of the last + quantifier, it skips all words of length 1. If I use the * quantifier, it will find a but also area- instead of area.
Then how could create a conditional regex saying: if the word contains an apostrophe or an hyphen, use the + quantifier else use the * quantifier ?
I suggest you to change the last [-\']?[a-z]+ part as optional by putting it into a group and then adding a ? quantifier next to that group.
>>> string = "a quake-prone area- (aujourd'hui-"
>>> RE_WORDS = re.compile(r'[a-z]+(?:[-\'][a-z]+)?')
>>> RE_WORDS.findall(string)
['a', 'quake-prone', 'area', "aujourd'hui"]
Reason for why the a is not printed is because of your regex contains two [a-z]+ which asserts that there must be atleast two lowercase letters present in the match.
Note that the regex i mentioned won't match area- because (?:[-\'][a-z]+)? optional group asserts that there must be atleast one lowercase letter would present just after to the - symbol. If no, then stop matching until it reaches the hyphen. So that you got area at the output instead of area- because there isn't an lowercase letter exists next to the -. Here it stops matching until it finds an hyphen without following lowercase letter.

How to find a non-alphanumeric character and move it to the end of a string in Python

I have the following string:
"string.isnotimportant"
I want to find the dot (it could be any non-alphanumeric character), and move it to the end of the string.
The result should look like:
"stringisnotimportant."
I am looking for a regular expression to do this job.
import re
inp = "string.isnotimportant"
re.sub('(\w*)(\W+)(\w*)', '\\1\\3\\2', inp)
>>> import re
>>> string = "string.isnotimportant"
#I explain a bit about this at the end
>>> regex = '\w*(\W+)\w*' # the brackets in the regex mean that item, if matched will be stored as a group
#in order to understand the re module properly, I think your best bet is to read some docs, I will link you at the end of the post
>>> x = re.search(regex, string)
>>> x.groups() #remember the stored group above? well this accesses that group.
#if there were more than one group above, there would be more items in the tuple
('.',)
#here I reassign the variable string to a modified version where the '.' is replaced with ''(nothing).
>>> string = string.replace('.', '')
>>> string += x.groups()[0] # here I basically append a letter to the end of string
The += operator appends a character to the end of a string. Since strings don't have an .append method like lists do, this is a handy feature. x.groups()[0] refers to the first item(only item in this case) of the tuple above.
>>> print string
"stringisnotimportant."
about the regex:
"\w" Matches any alphanumeric character and the underscore: a through z, A through Z, 0 through 9, and '_'.
"\W" Matches any non-alphanumeric character. Examples for this include '&', '$', '#', etc.
https://developers.google.com/edu/python/regular-expressions?csw=1
http://python.about.com/od/regularexpressions/a/regexprimer.htm

Remove duplicate chars using regex?

Let's say I want to remove all duplicate chars (of a particular char) in a string using regular expressions. This is simple -
import re
re.sub("a*", "a", "aaaa") # gives 'a'
What if I want to replace all duplicate chars (i.e. a,z) with that respective char? How do I do this?
import re
re.sub('[a-z]*', <what_to_put_here>, 'aabb') # should give 'ab'
re.sub('[a-z]*', <what_to_put_here>, 'abbccddeeffgg') # should give 'abcdefg'
NOTE: I know this remove duplicate approach can be better tackled with a hashtable or some O(n^2) algo, but I want to explore this using regexes
>>> import re
>>> re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
'fbq'
The () around the [a-z] specify a capture group, and then the \1 (a backreference) in both the pattern and the replacement refer to the contents of the first capture group.
Thus, the regex reads "find a letter, followed by one or more occurrences of that same letter" and then entire found portion is replaced with a single occurrence of the found letter.
On side note...
Your example code for just a is actually buggy:
>>> re.sub('a*', 'a', 'aaabbbccc')
'abababacacaca'
You really would want to use 'a+' for your regex instead of 'a*', since the * operator matches "0 or more" occurrences, and thus will match empty strings in between two non-a characters, whereas the + operator matches "1 or more".
In case you are also interested in removing duplicates of non-contiguous occurrences you have to wrap things in a loop, e.g. like this
s="ababacbdefefbcdefde"
while re.search(r'([a-z])(.*)\1', s):
s= re.sub(r'([a-z])(.*)\1', r'\1\2', s)
print s # prints 'abcdef'
A solution including all category:
re.sub(r'(.)\1+', r'\1', 'aaaaabbbbbb[[[[[')
gives:
'ab['

Categories

Resources