Remove repeating characters from words - python

I was wondering what is the best way to convert something like "haaaaapppppyyy" to "haappyy".
Basically, when parsing slang, people sometimes repeat characters for added emphasis.
I was wondering what the best way to do this is? Using set() doesn't work because the order of the letters is obviously important.
Any ideas? I'm using Python + nltk.

It can be done using regular expressions:
>>> import re
>>> re.sub(r'(.)\1+', r'\1\1', "haaaaapppppyyy")
'haappyy'
(.)\1+ repleaces any character (.) followed by one or more of the same character (because of the backref \1 it must be the same) by twice the character.

You can squash multiple occurrences of letters with itertools.groupby:
>>> ''.join(c for c, _ in groupby("haaaaapppppyyy"))
'hapy'
Similarly, you can get haappyy from groupby with
>>> ''.join(''.join(s)[:2] for _, s in groupby("haaaaapppppyyy"))
'haappyy'

You should do it without reduce or regexps:
>>> s = 'hhaaaaapppppyyy'
>>> ''.join(['' if i>1 and e==s[i-2] else e for i,e in enumerate(s)])
'haappyy'
The number of repetitions are hardcoded to >1 and -2 above. The general case:
>>> reps = 1
>>> ''.join(['' if i>reps-1 and e==s[i-reps] else e for i,e in enumerate(s)])
'hapy'

This is one way of doing it (limited to the obvious constraint that python doesn't speak english).
>>> s="haaaappppyy"
>>> reduce(lambda x,y: x+y if x[-2:]!=y*2 else x, s, "")
'haappyy'

Related

How can i keep the dot in a string while removing letters from the alphabet

I have a string: lst = 'sbs1.23444nroen'
im using lst2 = ''.join(filter(str.isdigit, lst)) to remove all the letters so the result is: lst2 = '123444'
is there any way to include the "." so that the result would be '1.23444' without the letters but keeping the dot?
A more friendly to the eye solution and extendable if you want to include more characters.
s = 'sbs1.23444nroen'
toKeep = set('0123456789.')
s = ''.join(ch for ch in s if ch in toKeep)
print(s)
lst2 = ''.join(filter(lambda x: str.isdigit(x) or x=='.', lst))
An alternative solution would be to use a regular expression, although the set solution is the best so far.
>>> re.findall(r"\d+\.\d*", lst)
['1.23444']
With the added benefit of grabbing other groups of numbers as well:
>>> re.findall(r"\d+\.?\d*", "sbs1.23444nroe631n")
['1.23444', '631']

Split a unicode string into components containing numbers and letters

I'd like to split the string u'123K into 123 and K. I've tried re.match("u'123K", "\d+") to match the number and re.match("u'123K", "K") to match the letter but they don't work. What is a Pythonic way to do this?
Use re.findall() to find all numbers and characters:
>>> s = u'123K'
>>> re.findall(r'\d+|[a-zA-Z]+', s) # or use r'\d+|\D+' as mentioned in comment in order to match all numbers and non-numbers.
['123', 'K']
If you are just dealing with this string or if you only want to split the string from the last character you can simply use a indexing:
num, charracter = s[:-1], s[-1:]
You can also use itertools.groupby method, grouping digits:
>>> import itertools as it
>>> for _,v in it.groupby(s, key=str.isdigit):
print(''.join(v))
123
K

python regular expression, pulling all letters out

Is there a better way to pull A and F from this: A13:F20
a="A13:F20"
import re
pattern = re.compile(r'\D+\d+\D+')
matches = re.search(pattern, a)
num = matches.group(0)
print num[0]
print num[len(num)-1]
output
A
F
note: the digits are of unknown length
You don't have to use regular expressions, or re at all. Assuming you want just letters to remain, you could do something like this:
a = "A13:F20"
a = filter(lambda x: x.isalpha(), a)
I'd do it like this:
>>> re.findall(r'[a-z]', a, re.IGNORECASE)
['A', 'F']
Use a simple list comprehension, as a filter and get only the alphabets from the actual string.
print [char for char in input_string if char.isalpha()]
# ['A', 'F']
You could use re.sub:
>>> a="A13.F20"
>>> re.sub(r'[^A-Z]', '', a) # Remove everything apart from A-Z
'AF'
>>> re.sub(r'[A-Z]', '', a) # Remove A-Z
'13.20'
>>>
If you're working with strings that all have the same format, you can just cut out substrings:
a="A13:F20"
print a[0], a[4]
More on python slicing in this answer:
Is there a way to substring a string in Python?

python: keep char only if it is within this list

i have a list:
a = ['a','b','c'.........'A','B','C'.........'Z']
and i have string:
string1= 's#$%ERGdfhliisgdfjkskjdfW$JWLI3590823r'
i want to keep ONLY those characters in string1 that exist in a
what is the most effecient way to do this? perhaps instead of having a be a list, i should just make it a string? like this a='abcdefg..........ABC..Z' ??
This should be faster.
>>> import re
>>> string1 = 's#$%ERGdfhliisgdfjkskjdfW$JWLI3590823r'
>>> a = ['E', 'i', 'W']
>>> r = re.compile('[^%s]+' % ''.join(a))
>>> print r.sub('', string1)
EiiWW
This is even faster than that.
>>> all_else = ''.join( chr(i) for i in range(256) if chr(i) not in set(a) )
>>> string1.translate(None, all_else)
'EiiWW'
44 microsec vs 13 microsec on my laptop.
How about that?
(Edit: turned out, translate yields the best performance.)
''.join([s for s in string1 if s in a])
Explanation:
[s for s in string1 if s in a]
creates a list of all characters in string1, but only if they are also in the list a.
''.join([...])
turns it back into a string by joining it with nothing ('') in between the elements of the given list.
List comprehension to the rescue!
wanted = ''.join(letter for letter in string1 if letter in a)
(Note that when passing a list comprehension to a function you can omit the brackets so that the full list isn't generated prior to being evaluated. While semantically the same as a list comprehension, this is called a generator expression.)
If, you are going to do this with large strings, there is a faster solution using translate; see this answer.
#katrielalex: To spell it out:
import string
string1= 's#$%ERGdfhliisgdfjkskjdfW$JWLI3590823r'
non_letters= ''.join(chr(i) for i in range(256) if chr(i) not in string.letters)
print string1.translate(None,non_letters)
print 'Simpler, but possibly less correct'
print string1.translate(None, string.punctuation+string.digits+string.whitespace)

How do I coalesce a sequence of identical characters into just one?

Suppose I have this:
My---sun--is------very-big---.
I want to replace all multiple hyphens with just one hyphen.
import re
astr='My---sun--is------very-big---.'
print(re.sub('-+','-',astr))
# My-sun-is-very-big-.
If you want to replace any run of consecutive characters, you can use
>>> import re
>>> a = "AA---BC++++DDDD-EE$$$$FF"
>>> print(re.sub(r"(.)\1+",r"\1",a))
A-BC+D-E$F
If you only want to coalesce non-word-characters, use
>>> print(re.sub(r"(\W)\1+",r"\1",a))
AA-BC+DDDD-EE$FF
If it's really just hyphens, I recommend unutbu's solution.
If you really only want to coalesce hyphens, use the other suggestions. Otherwise you can write your own function, something like this:
>>> def coalesce(x):
... n = []
... for c in x:
... if not n or c != n[-1]:
... n.append(c)
... return ''.join(n)
...
>>> coalesce('My---sun--is------very-big---.')
'My-sun-is-very-big-.'
>>> coalesce('aaabbbccc')
'abc'
As usual, there's a nice itertools solution, using groupby:
>>> from itertools import groupby
>>> s = 'aaaaa----bbb-----cccc----d-d-d'
>>> ''.join(key for key, group in groupby(s))
'a-b-c-d-d-d'
How about:
>>> import re
>>> re.sub("-+", "-", "My---sun--is------very-big---.")
'My-sun-is-very-big-.'
the regular expression "-+" will look for 1 or more "-".
re.sub('-+', '-', "My---sun--is------very-big---")
How about an alternate without the re module:
'-'.join(filter(lambda w: len(w) > 0, 'My---sun--is------very-big---.'.split("-")))
Or going with Tim and FogleBird's previous suggestion, here's a more general method:
def coalesce_factory(x):
return lambda sent: x.join(filter(lambda w: len(w) > 0, sent.split(x)))
hyphen_coalesce = coalesce_factory("-")
hyphen_coalesce('My---sun--is------very-big---.')
Though personally, I would use the re module first :)
mcpeterson
Another simple solution is the String object's replace function.
while '--' in astr:
astr = astr.replace('--','-')
if you don't want to use regular expressions:
my_string = my_string.split('-')
my_string = filter(None, my_string)
my_string = '-'.join(my_string)
I have
my_str = 'a, b,,,,, c, , , d'
I want
'a,b,c,d'
compress all the blanks (the "replace" bit), then split on the comma, then if not None join with a comma in between:
my_str_2 = ','.join([i for i in my_str.replace(" ", "").split(',') if i])

Categories

Resources