How do I coalesce a sequence of identical characters into just one?

How do I coalesce a sequence of identical characters into just one? - python

Suppose I have this:
My---sun--is------very-big---.
I want to replace all multiple hyphens with just one hyphen.

import re
astr='My---sun--is------very-big---.'
print(re.sub('-+','-',astr))
# My-sun-is-very-big-.

If you want to replace any run of consecutive characters, you can use
>>> import re
>>> a = "AA---BC++++DDDD-EE$$$$FF"
>>> print(re.sub(r"(.)\1+",r"\1",a))
A-BC+D-E$F
If you only want to coalesce non-word-characters, use
>>> print(re.sub(r"(\W)\1+",r"\1",a))
AA-BC+DDDD-EE$FF
If it's really just hyphens, I recommend unutbu's solution.

If you really only want to coalesce hyphens, use the other suggestions. Otherwise you can write your own function, something like this:
>>> def coalesce(x):
... n = []
... for c in x:
... if not n or c != n[-1]:
... n.append(c)
... return ''.join(n)
...
>>> coalesce('My---sun--is------very-big---.')
'My-sun-is-very-big-.'
>>> coalesce('aaabbbccc')
'abc'

As usual, there's a nice itertools solution, using groupby:
>>> from itertools import groupby
>>> s = 'aaaaa----bbb-----cccc----d-d-d'
>>> ''.join(key for key, group in groupby(s))
'a-b-c-d-d-d'

How about:
>>> import re
>>> re.sub("-+", "-", "My---sun--is------very-big---.")
'My-sun-is-very-big-.'
the regular expression "-+" will look for 1 or more "-".

re.sub('-+', '-', "My---sun--is------very-big---")

How about an alternate without the re module:
'-'.join(filter(lambda w: len(w) > 0, 'My---sun--is------very-big---.'.split("-")))
Or going with Tim and FogleBird's previous suggestion, here's a more general method:
def coalesce_factory(x):
return lambda sent: x.join(filter(lambda w: len(w) > 0, sent.split(x)))
hyphen_coalesce = coalesce_factory("-")
hyphen_coalesce('My---sun--is------very-big---.')
Though personally, I would use the re module first :)
mcpeterson

Another simple solution is the String object's replace function.
while '--' in astr:
astr = astr.replace('--','-')

if you don't want to use regular expressions:
my_string = my_string.split('-')
my_string = filter(None, my_string)
my_string = '-'.join(my_string)

I have
my_str = 'a, b,,,,, c, , , d'
I want
'a,b,c,d'
compress all the blanks (the "replace" bit), then split on the comma, then if not None join with a comma in between:
my_str_2 = ','.join([i for i in my_str.replace(" ", "").split(',') if i])

Related

python regular expression split function issue

I'm using python2 and I want to get rid of these empty strings in the output of the following python regular expression:
import re
x = "010101000110100001100001"
print re.split("([0-1]{8})", x)
and the output is this :
['', '01010100', '', '01101000', '', '01100001', '']
I just want to get this output:
['01010100', '01101000', '01100001']

Regex probably isn't what you want to use in this case. It seems that you want to just split the string into groups of n (8) characters.
I poached an answer from this question.
def split_every(n, s):
return [ s[i:i+n] for i in xrange(0, len(s), n) ]
split_every(8, "010101000110100001100001")
Out[2]: ['01010100', '01101000', '01100001']

One possible way:
print filter(None, re.split("([0-1]{8})", x))

import re
x = "010101000110100001100001"
l = re.split("([0-1]{8})", x)
l2 = [i for i in l if i]
out:
['01010100', '01101000', '01100001']

This is exactly what is split for. It is split string using regular expression as separator.
If you need to find all matches try use findall instead:
import re
x = "010101000110100001100001"
print(re.findall("([0-1]{8})", x))

print([a for a in re.split("([0-1]{8})", x) if a != ''])

Following your regex approach, you can simply use a filter to get your desired output.
import re
x = "010101000110100001100001"
unfiltered_list = re.split("([0-1]{8})", x)
print filter(None, unfiltered_list)
If you run this, you should get:
['01010100', '01101000', '01100001']

python regular expression, pulling all letters out

Is there a better way to pull A and F from this: A13:F20
a="A13:F20"
import re
pattern = re.compile(r'\D+\d+\D+')
matches = re.search(pattern, a)
num = matches.group(0)
print num[0]
print num[len(num)-1]
output
A
F
note: the digits are of unknown length

You don't have to use regular expressions, or re at all. Assuming you want just letters to remain, you could do something like this:
a = "A13:F20"
a = filter(lambda x: x.isalpha(), a)

I'd do it like this:
>>> re.findall(r'[a-z]', a, re.IGNORECASE)
['A', 'F']

Use a simple list comprehension, as a filter and get only the alphabets from the actual string.
print [char for char in input_string if char.isalpha()]
# ['A', 'F']

You could use re.sub:
>>> a="A13.F20"
>>> re.sub(r'[^A-Z]', '', a) # Remove everything apart from A-Z
'AF'
>>> re.sub(r'[A-Z]', '', a) # Remove A-Z
'13.20'
>>>

If you're working with strings that all have the same format, you can just cut out substrings:
a="A13:F20"
print a[0], a[4]
More on python slicing in this answer:
Is there a way to substring a string in Python?

(Python) Splitting string only on single instance of delimiter

I'm trying to extract numeric values from text strings that use dashes as delimiters, but also to indicate negative values:
"1.3" # [1.3]
"1.3-2-3.9" # [1.3, 2, 3.9]
"1.3-2--3.9" # [1.3, 2, -3.9]
"-1.3-2--3.9" # [-1.3, 2, -3.9]
At the moment, I'm manually checking for the "--" sequence, but this seems really ugly and prone to breaking.
def get_values(text):
return map(lambda s: s.replace('n', '-'), text.replace('--', '-n').split('-'))
I've tried a few different approaches, using both the str.split() function and re.findall(), but none of them have quite worked.
For example, the following pattern should match all the valid strings, but I'm not sure how to use it with findall:
r"^-?\d(\.\d*)?(--?\d(\.\d*)?)*$"
Is there a general way to do this that I'm not seeing? Thanks!

You can try to split with this pattern with a lookbehind:
(?<=[0-9])-
(An hyphen preceded by a digit)
>>> import re
>>> re.split('(?<=[0-9])-', text)
With this condition, you are sure to not be after the start of the string or after an other hyphen.

#CasimiretHippolyte has given a very elegant Regex solution, but I would like to point out that you can do this pretty succinctly with just a list comprehension, iter, and next:
>>> def get_values(text):
... it = iter(text.split("-"))
... return [x or "-"+next(it) for x in it]
...
>>> get_values("1.3")
['1.3']
>>> get_values("1.3-2-3.9")
['1.3', '2', '3.9']
>>> get_values("1.3-2--3.9")
['1.3', '2', '-3.9']
>>> get_values("-1.3-2--3.9")
['-1.3', '2', '-3.9']
>>>
Also, if you use timeit.timeit, you will see that this solution is quite a bit faster than using Regex:
>>> from timeit import timeit
>>>
>>> # With Regex
>>> def get_values(text):
... import re
... return re.split('(?<=[0-9])-', text)
...
>>> timeit('get_values("-1.3-2--3.9")', 'from __main__ import get_values')
9.999720634885165
>>>
>>> # Without Regex
>>> def get_values(text):
... it = iter(text.split("-"))
... return [x or "-"+next(it) for x in it]
...
>>> timeit('get_values("-1.3-2--3.9")', 'from __main__ import get_values')
4.145546989910741
>>>

Python Regular expression repeat

I have a string like this
--x123-09827--x456-9908872--x789-267504
I am trying to get all value like
123:09827
456:9908872
789:267504
I've tried (--x([0-9]+)-([0-9])+)+
but it only gives me last pair result, I am testing it through python
>>> import re
>>> x = "--x123-09827--x456-9908872--x789-267504"
>>> p = "(--x([0-9]+)-([0-9]+))+"
>>> re.match(p,x)
>>> re.match(p,x).groups()
('--x789-267504', '789', '267504')
How should I write with nested repeat pattern?
Thanks a lot!
David

Code it like this:
x = "--x123-09827--x456-9908872--x789-267504"
p = "--x(?:[0-9]+)-(?:[0-9]+)"
print re.findall(p,x)

Just use the .findall method instead, it makes the expression simpler.
>>> import re
>>> x = "--x123-09827--x456-9908872--x789-267504"
>>> r = re.compile(r"--x(\d+)-(\d+)")
>>> r.findall(x)
[('123', '09827'), ('456', '9908872'), ('789', '267504')]
You can also use .finditer which might be helpful for longer strings.
>>> [m.groups() for m in r.finditer(x)]
[('123', '09827'), ('456', '9908872'), ('789', '267504')]

Use re.finditer or re.findall. Then you don't need the extra pair of parentheses that wrap the entire expression. For example,
>>> import re
>>> x = "--x123-09827--x456-9908872--x789-267504"
>>> p = "--x([0-9]+)-([0-9]+)"
>>> for m in re.finditer(p,x):
>>> print '{0} {1}'.format(m.group(1),m.group(2))

try this
p='--x([0-9]+)-([0-9]+)'
re.findall(p,x)

No need to use regex :
>>> "--x123-09827--x456-9908872--x789-267504".replace('--x',' ').replace('-',':').strip()
'123:09827 456:9908872 789:267504'

You don't need regular expressions for this. Here is a simple one-liner, non-regex solution:
>>> input = "--x123-09827--x456-9908872--x789-267504"
>>> [ x.replace("-", ":") for x in input.split("--x")[1:] ]
['123:09827', '456:9908872', '789:267504']
If this is an exercise on regex, here is a solution that uses the repetition (technically), though the findall(...) solution may be preferred:
>>> import re
>>> input = "--x123-09827--x456-9908872--x789-267504"
>>> regex = '--x(.+)'
>>> [ x.replace("-", ":") for x in re.match(regex*3, input).groups() ]
['123:09827', '456:9908872', '789:267504']

Remove repeating characters from words

I was wondering what is the best way to convert something like "haaaaapppppyyy" to "haappyy".
Basically, when parsing slang, people sometimes repeat characters for added emphasis.
I was wondering what the best way to do this is? Using set() doesn't work because the order of the letters is obviously important.
Any ideas? I'm using Python + nltk.

It can be done using regular expressions:
>>> import re
>>> re.sub(r'(.)\1+', r'\1\1', "haaaaapppppyyy")
'haappyy'
(.)\1+ repleaces any character (.) followed by one or more of the same character (because of the backref \1 it must be the same) by twice the character.

You can squash multiple occurrences of letters with itertools.groupby:
>>> ''.join(c for c, _ in groupby("haaaaapppppyyy"))
'hapy'
Similarly, you can get haappyy from groupby with
>>> ''.join(''.join(s)[:2] for _, s in groupby("haaaaapppppyyy"))
'haappyy'

You should do it without reduce or regexps:
>>> s = 'hhaaaaapppppyyy'
>>> ''.join(['' if i>1 and e==s[i-2] else e for i,e in enumerate(s)])
'haappyy'
The number of repetitions are hardcoded to >1 and -2 above. The general case:
>>> reps = 1
>>> ''.join(['' if i>reps-1 and e==s[i-reps] else e for i,e in enumerate(s)])
'hapy'

This is one way of doing it (limited to the obvious constraint that python doesn't speak english).
>>> s="haaaappppyy"
>>> reduce(lambda x,y: x+y if x[-2:]!=y*2 else x, s, "")
'haappyy'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I coalesce a sequence of identical characters into just one? - python

Suppose I have this: My---sun--is------very-big---. I want to replace all multiple hyphens with just one hyphen.

import re astr='My---sun--is------very-big---.' print(re.sub('-+','-',astr)) # My-sun-is-very-big-.

As usual, there's a nice itertools solution, using groupby: >>> from itertools import groupby >>> s = 'aaaaa----bbb-----cccc----d-d-d' >>> ''.join(key for key, group in groupby(s)) 'a-b-c-d-d-d'

How about: >>> import re >>> re.sub("-+", "-", "My---sun--is------very-big---.") 'My-sun-is-very-big-.' the regular expression "-+" will look for 1 or more "-".

re.sub('-+', '-', "My---sun--is------very-big---")

Another simple solution is the String object's replace function. while '--' in astr: astr = astr.replace('--','-')

if you don't want to use regular expressions: my_string = my_string.split('-') my_string = filter(None, my_string) my_string = '-'.join(my_string)

I have my_str = 'a, b,,,,, c, , , d' I want 'a,b,c,d' compress all the blanks (the "replace" bit), then split on the comma, then if not None join with a comma in between: my_str_2 = ','.join([i for i in my_str.replace(" ", "").split(',') if i])

Related

python regular expression split function issue

python regular expression, pulling all letters out

(Python) Splitting string only on single instance of delimiter

Python Regular expression repeat

Remove repeating characters from words

Categories

Resources