Python regex with number of occurences

Python regex with number of occurences - python

Hi I'm looking for a regular expression that would allow me not only to replace characters but also to annotate the occurrence number.
For example I would like to replace all special characters with "s", all letters with "c" and all number with "d" and annotate their occurrence between "{}".
If I have "123-45AB-78!£", I would like to get d{3}s{1}d{3}c{2}s{1}d{2}s{2}.
Is there a way to do that with regex?
Many thanks

Here is one approach using re.sub with a callback function:
import re
def repl(m):
c = m.group()
if re.search(r'^[A-Za-z]+$', c):
return 'c{' + str(len(c.decode('utf8'))) + '}'
elif re.search(r'^\d+$', c):
return 'd{' + str(len(c.decode('utf8'))) + '}'
else:
return 's{' + str(len(c.decode('utf8'))) + '}'
x = "123-45AB-78!£"
print(re.sub('[A-Za-z]+|\d+|\D+', repl, x))
# d{3}s{1}d{2}c{2}s{1}d{2}s{2}
Note that since your input string contains non ASCII characters, we cannot simply use len() to find the numbes of characters in the string. Assuming a UTF-8 character set and a string str, we can use the following formula:
len(str.decode('utf8'))

Here is a method that first replaces each character by its type-character, then counts them with itertools.groupby. I'm not sure it is any faster than the good answer given by Tim, but it should be comparable.
x = "123-45AB-78!£"
for pat, sub in [(r"[A-Za-z]", "c"), (r"\d", "d"), (r"[^\d\w]", "s")]:
x = re.sub(pat, sub, x)
print(x) # dddsddccsddss
y = "".join([f"{k}{{{len(list(g))}}}" for k, g in groupby(x)])
print(y) # d{3}s{1}d{2}c{2}s{1}d{2}s{2}

Related

How can I remove specific duplicates from a list, rather than remove all duplicates indiscriminately?

In a python script, I need to assess whether a string contains duplicates of a specific character (e.g., "f") and, if so, remove all but the first instance of that character. Other characters in the string may also have duplicates, but the script should not remove any duplicates other than those of the specified character.
This is what I've got so far. The script runs, but it is not accomplishing the desired task. I modified the reduce() line from the top answer to this question, but it's a little more complex than what I've learned at this point, so it's difficult for me to tell what part of this is wrong.
import re
from functools import reduce
string = "100 ffeet"
dups = ["f", "t"]
for char in dups:
if string.count(char) > 1:
lst = list(string)
reduce(lambda acc, el: acc if re.match(char, el) and el in acc else acc + [el], lst, [])
string = "".join(lst)

Let's create a function that receives a string s and a character c as parameters, and returns a new string where all but the first occurrence of c in s are removed.
We'll be making use of the following functions from Python std lib:
str.find(sub): Return the lowest index in the string where substring sub is found.
str.replace(old, new): Return a copy of the string with all occurrences of substring old replaced by new.
The idea is straightforward:
Find the first index of c in s
If none is found, return s
Make a substring of s starting from the next character after c
Remove all occurrences of c in the substring
Concatenate the first part of s with the updated substring
Return the final string
In Python:
def remove_all_but_first(s, c):
i = s.find(c)
if i == -1:
return s
i += 1
return s[:i] + s[i:].replace(c, '')
Now you can use this function to remove all the characters you want.
def main():
s = '100 ffffffffeet'
dups = ['f', 't', 'x']
print('Before:', s)
for c in dups:
s = remove_all_but_first(s, c)
print('After:', s)
if __name__ == '__main__':
main()

Here is one way that you could do it
string = "100 ffeet"
dups = ["f", "t"]
seen = []
for s in range(len(string)-1,0,-1):
if string[s] in dups and string[s] in seen:
string = string[:s] + '' + string[s+1:]
elif string[s] in dups:
seen.append(string[s])
print(string)

If a specific string, A, is present at the begining and/or end of a string B, how do we remove A from B?

My question is similar, but different from the following:
How do I remove a substring from the end of a string in Python?
Suppose we have:
input = "baabbbbb_xx_ba_xxx_abbbbbba"
We want to want to keep everything except the ba at the end and ba at the beginning.
1) Direct strip() fails
strip treats the string as a set. That is, strip will remove the letters a and b appearing in any order. We want to only remove the characters ba if they appear in that exact order. Also, unlike strip, we want only zero or one copies removed from the end of the string. "x\n\n\n\n".strip() will remove many new-lines, not just one.
input = "baabbbbb_xx_ba_xxx_abbbbbba"
output = input.strip("ba")
print(output)
prints "_xx_ba_xxx_"
2) Direct replace() fails
input = "xx_ba_xxx"
output = input.replace("ba", "")
print(output)
# prints `xx__xxx`
Not cool; we only want to remove the sequence "ba" from the beginning and end of the string, not the middle.
3) Just nope
input = "baabbbbb_xx_ba_xxx_abbbbbba"
output = "ba".join(input.rsplit("ba", 1))
print(output)
# output==input
Final Note
The solution must be general: a function accepting any two input strings, once of which might not be "ba". The undesired leading and trailing strings might contain ".", "*" and other characters not nice for use in regular expressions.

My solution uses basic hashing, however, be aware of hash collision.
Let me know if this helps you with your problem.
import functools
def strip_ed(pattern, string):
# pattern is not a substring of string
if len(pattern) > len(string):
return -1
base = 26
# Hash codes for the beginning of the string
string_hash_beginning = functools.reduce(lambda h, c: h * base + ord(c), string[:len(pattern)], 0)
# Hash codes for the ending of the string
string_hash_end = functools.reduce(lambda h, c: h * base + ord(c), string[-len(pattern):], 0)
# Hash codes for the pattern
pattern_hash = functools.reduce(lambda h, c: h * base + ord(c), pattern, 0)
while True:
if string_hash_beginning == string_hash_end and \
string_hash_beginning == pattern_hash and \
string[:len(pattern)] == pattern:
return string[len(pattern):-len(pattern)]
elif string_hash_beginning == pattern_hash and string[:len(pattern)] == pattern:
return string[len(pattern):]
elif string_hash_end == pattern_hash and string[-len(pattern):] == pattern:
return string[:-len(pattern)]
else:
return string

This seems to work:
def ordered_strip(whole, part):
center = whole
if whole.endswith(part):
center = center[:-len(part)]
if whole.startswith(part):
center = center[len(part):]
return center

How can we remove word with repeated single character?

I am trying to remove word with single repeated characters using regex in python, for example :
good => good
gggggggg => g
What I have tried so far is following
re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
Problem with above solution is that it changes good to god and I just want to remove words with single repeated characters.

A better approach here is to use a set
def modify(s):
#Create a set from the string
c = set(s)
#If you have only one character in the set, convert set to string
if len(c) == 1:
return ''.join(c)
#Else return original string
else:
return s
print(modify('good'))
print(modify('gggggggg'))
If you want to use regex, mark the start and end of the string in our regex by ^ and $ (inspired from #bobblebubble comment)
import re
def modify(s):
#Create the sub string with a regex which only matches if a single character is repeated
#Marking the start and end of string as well
out = re.sub(r'^([a-z])\1+$', r'\1', s)
return out
print(modify('good'))
print(modify('gggggggg'))
The output will be
good
g

If you do not want to use a set in your method, this should do the trick:
def simplify(s):
l = len(s)
if l>1 and s.count(s[0]) == l:
return s[0]
return s
print(simplify('good'))
print(simplify('abba'))
print(simplify('ggggg'))
print(simplify('g'))
print(simplify(''))
output:
good
abba
g
g
Explanations:
You compute the length of the string
you count the number of characters that are equal to the first one and you compare the count with the initial string length
depending on the result you return the first character or the whole string

You can use trim command:
take a look at this examples:
"ggggggg".Trim('g');
Update:
and for characters which are in the middle of the string use this function, thanks to this answer
in java:
public static string RemoveDuplicates(string input)
{
return new string(input.ToCharArray().Distinct().ToArray());
}
in python:
used = set()
unique = [x for x in mylist if x not in used and (used.add(x) or True)]
but I think all of these answers does not match situation like aaaaabbbbbcda, this string has an a at the end of string which does not appear in the result (abcd). for this kind of situation use this functions which I wrote:
In:
def unique(s):
used = set()
ret = list()
s = list(s)
for x in s:
if x not in used:
ret.append(x)
used = set()
used.add(x)
return ret
print(unique('aaaaabbbbbcda'))
out:
['a', 'b', 'c', 'd', 'a']

Replace substring of given indices range

I am new in Python programming. I am stuck at one point. Let's say I have string "hello-world". I want to replace all the characters of this string with "*" except first & last. so the result will be "h***-****d".
One way to do this as below:
In [1]: s = "hello-world"
In [2]: s[0] + "*"*(len(s)-2) + s[-1]
Out[2]: 'h*********d'
If I want to replace all characters with "*" except first & last 2 characters
In [3]: s[:2] + "*"*(len(s)-4) + s[-2:]
Out[3]: 'he*******ld'
Is there any pretty way to handle these type of problems. Any help would be appreciated. Thanks.

I think what you want to do is this:
def obscure(string, n):
characters = list(string)
characters[n:-n] = '*' * len(characters[n:-n])
obscured = ''.join(characters)
return obscured
Turn the string into a list of characters. Replace the ones you want to obscure. Then join the list back into a string.

You can use str.join (and the string module to check against letters):
s[0] + ''.join(['*' if i in string.ascii_letters else i
for i in s[1:-1]]) + s[-1]
Since you said you wanted h****-****d where the hyphen isn't replaced, you would need to test whether the characters are letters or not. You could change string.ascii_letters to:
chars = 'abcdefghijklmnopqrstuvwxyz'
chars = chars + chars.upper() + '0123456789' # + 'some_other_chars'
...if you want to include other characters like numbers or punctuation. Or you can write out the letters you want to replace manually.
You may also want to perform a check to see whether the string is 3 characters or more so that no errors are raised.

You could define a function to not repeat yourself:
def replace(s, n):
if len(s) > n*2:
return s[:n] + '*'*(len(s)-n*2) + s[-n:]
return s
print(replace('hello-world', 1)) # h*********d
print(replace('hello-world', 2)) # he*******ld
print(replace('hello', 2)) # he*lo
print(replace('hello', 3)) # hello
You can also use some kind of string formatting instead of concatenation (which should be more efficient), e.g. f-strings available in 3.6+:
def replace(s, n):
if len(s) > n*2:
return f"{ s[:n] }{ '*'*(len(s)-n*2) }{ s[-n:] }"
return s

You can try this.
s="hello-world"
for i in s[1:-1]:
if i.isalpha():
s=s.replace(i,"*")

Python - making a function that would add "-" between letters

I'm trying to make a function, f(x), that would add a "-" between each letter:
For example:
f("James")
should output as:
J-a-m-e-s-
I would love it if you could use simple python functions as I am new to programming. Thanks in advance. Also, please use the "for" function because it is what I'm trying to learn.
Edit:
yes, I do want the "-" after the "s".

Can I try like this:
>>> def f(n):
... return '-'.join(n)
...
>>> f('james')
'j-a-m-e-s'
>>>
Not really sure if you require the last 'hyphen'.
Edit:
Even if you want suffixed '-', then can do like
def f(n):
return '-'.join(n) + '-'
As being learner, it is important to understand for your that "better to concat more than two strings in python" would be using str.join(iterable), whereas + operator is fine to append one string with another.
Please read following posts to explore further:
Any reason not to use + to concatenate two strings?
which is better to concat string in python?
How slow is Python's string concatenation vs. str.join?

Also, please use the "for" function because it is what I'm trying to learn
>>> def f(s):
m = s[0]
for i in s[1:]:
m += '-' + i
return m
>>> f("James")
'J-a-m-e-s'
m = s[0] character at the index 0 is assigned to the variable m
for i in s[1:]: iterate from the second character and
m += '-' + i append - + char to the variable m
Finally return the value of variable m
If you want - at the last then you could do like this.
>>> def f(s):
m = ""
for i in s:
m += i + '-'
return m
>>> f("James")
'J-a-m-e-s-'

text_list = [c+"-" for c in text]
text_strung = "".join(text_list)

As a function, takes a string as input.
def dashify(input):
output = ""
for ch in input:
output = output + ch + "-"
return output

Given you asked for a solution that uses for and a final -, simply iterate over the message and add the character and '-' to an intermediate list, then join it up. This avoids the use of string concatenations:
>>> def f(message)
l = []
for c in message:
l.append(c)
l.append('-')
return "".join(l)
>>> print(f('James'))
J-a-m-e-s-

I'm sorry, but I just have to take Alexander Ravikovich's answer a step further:
f = lambda text: "".join([c+"-" for c in text])
print(f('James')) # J-a-m-e-s-
It is never too early to learn about list comprehension.
"".join(a_list) is self-explanatory: glueing elements of a list together with a string (empty string in this example).
lambda... well that's just a way to define a function in a line. Think
square = lambda x: x**2
square(2) # returns 4
square(3) # returns 9
Python is fun, it's not {enter-a-boring-programming-language-here}.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex with number of occurences - python

Related

How can I remove specific duplicates from a list, rather than remove all duplicates indiscriminately?

If a specific string, A, is present at the begining and/or end of a string B, how do we remove A from B?

How can we remove word with repeated single character?

Replace substring of given indices range

Python - making a function that would add "-" between letters

Categories

Resources