(Python) Splitting string only on single instance of delimiter - python

I'm trying to extract numeric values from text strings that use dashes as delimiters, but also to indicate negative values:
"1.3" # [1.3]
"1.3-2-3.9" # [1.3, 2, 3.9]
"1.3-2--3.9" # [1.3, 2, -3.9]
"-1.3-2--3.9" # [-1.3, 2, -3.9]
At the moment, I'm manually checking for the "--" sequence, but this seems really ugly and prone to breaking.
def get_values(text):
return map(lambda s: s.replace('n', '-'), text.replace('--', '-n').split('-'))
I've tried a few different approaches, using both the str.split() function and re.findall(), but none of them have quite worked.
For example, the following pattern should match all the valid strings, but I'm not sure how to use it with findall:
r"^-?\d(\.\d*)?(--?\d(\.\d*)?)*$"
Is there a general way to do this that I'm not seeing? Thanks!

You can try to split with this pattern with a lookbehind:
(?<=[0-9])-
(An hyphen preceded by a digit)
>>> import re
>>> re.split('(?<=[0-9])-', text)
With this condition, you are sure to not be after the start of the string or after an other hyphen.

#CasimiretHippolyte has given a very elegant Regex solution, but I would like to point out that you can do this pretty succinctly with just a list comprehension, iter, and next:
>>> def get_values(text):
... it = iter(text.split("-"))
... return [x or "-"+next(it) for x in it]
...
>>> get_values("1.3")
['1.3']
>>> get_values("1.3-2-3.9")
['1.3', '2', '3.9']
>>> get_values("1.3-2--3.9")
['1.3', '2', '-3.9']
>>> get_values("-1.3-2--3.9")
['-1.3', '2', '-3.9']
>>>
Also, if you use timeit.timeit, you will see that this solution is quite a bit faster than using Regex:
>>> from timeit import timeit
>>>
>>> # With Regex
>>> def get_values(text):
... import re
... return re.split('(?<=[0-9])-', text)
...
>>> timeit('get_values("-1.3-2--3.9")', 'from __main__ import get_values')
9.999720634885165
>>>
>>> # Without Regex
>>> def get_values(text):
... it = iter(text.split("-"))
... return [x or "-"+next(it) for x in it]
...
>>> timeit('get_values("-1.3-2--3.9")', 'from __main__ import get_values')
4.145546989910741
>>>

Related

python split and remove duplicates

I have the following output with print var:
test.qa.home-page.website.com-3412-jan
test.qa.home-page.website.net-5132-mar
test.qa.home-page.website.com-8422-aug
test.qa.home-page.website.net-9111-jan
I'm trying to find the correct split function to populate below:
test.qa.home-page.website.com
test.qa.home-page.website.net
test.qa.home-page.website.com
test.qa.home-page.website.net
...as well as remove duplicates:
test.qa.home-page.website.com
test.qa.home-page.website.net
The numeric values after "com-" or "net-" are random so I think my struggle is finding out how to rsplit ("-" + [CHECK_FOR_ANY_NUMBER])[0] . Any suggestions would be great, thanks in advance!
How about :
import re
output = [
"test.qa.home-page.website.com-3412-jan",
"test.qa.home-page.website.net-5132-mar",
"test.qa.home-page.website.com-8422-aug",
"test.qa.home-page.website.net-9111-jan"
]
trimmed = set([re.split("-[0-9]", item)[0] for item in output])
print(trimmed)
# out : {'test.qa.home-page.website.net', 'test.qa.home-page.website.com'}
If you have an array of values, and you want to remove duplicates, you can use set.
>>> l = [1,2,3,1,2,3]
>>> l
[1, 2, 3, 1, 2, 3]
>>> set(l)
{1, 2, 3}
You can get to a useful array by str.split('-')[0]-ing every value.
You could use a regex to parse the individual lines and a set comprehension to uniqueify:
txt='''\
test.qa.home-page.website.com-3412-jan
test.qa.home-page.website.net-5132-mar
test.qa.home-page.website.com-8422-aug
test.qa.home-page.website.net-9111-jan'''
import re
>>> {re.sub(r'^(.*\.(?:com|net)).*', r'\1', s) for s in txt.split() }
{'test.qa.home-page.website.net', 'test.qa.home-page.website.com'}
Or just use the same regex with set and re.findall with the re.M flag:
>>> set(re.findall(r'^(.*\.(?:com|net))', txt, flags=re.M))
{'test.qa.home-page.website.net', 'test.qa.home-page.website.com'}
If you want to maintain order, use {}.fromkeys() (since Python 3.6):
>>> list({}.fromkeys(re.findall(r'^(.*\.(?:com|net))', txt, flags=re.M)).keys())
['test.qa.home-page.website.com', 'test.qa.home-page.website.net']
Or, if you know your target is always 2 - from the end, just use .rsplit() with maxsplit=2:
>>> {s.rsplit('-',maxsplit=2)[0] for s in txt.splitlines()}
{'test.qa.home-page.website.com', 'test.qa.home-page.website.net'}

Substring[whole word] check using a string variable

In Python2.7, I am trying the following:
>>> import re
>>> text='0.0.0.0/0 172.36.128.214'
>>> far_end_ip="172.36.128.214"
>>>
>>>
>>> chk=re.search(r"\b172.36.128.214\b",text)
>>> chk
<_sre.SRE_Match object at 0x0000000002349578>
>>> chk=re.search(r"\b172.36.128.21\b",text)
>>> chk
>>> chk=re.search(r"\b"+far_end_ip+"\b",text)
>>>
>>> chk
>>>
Q:how can i make the search work when using the variable far_end_ip
Two issues:
You need to write the last bit of the string as a regex literal or escape the backslash: ... + r"\b"
You should escape the dots in the text to find: ... + re.escape(far_end_ip)
So:
re.search(r"\b" + re.escape(far_end_ip) + r"\b",text)
See also "How to use a variable inside a regular expression?".

Split string into tuple (Upper,lower) 'ABCDefgh' . Python 2.7.6

my_string = 'ABCDefgh'
desired = ('ABCD','efgh')
the only way I can think of doing this is creating a for loop and then scanning through and checking each element in the string individually and adding to string and then creating the tuple . . . is there a more efficient way to do this?
it will always be in the format UPPERlower
print re.split("([A-Z]+)",my_string)[1:]
Simple way (two passes):
>>> import itertools
>>> my_string = 'ABCDefgh'
>>> desired = (''.join(itertools.takewhile(lambda c:c.isupper(), my_string)), ''.join(itertools.dropwhile(lambda c:c.isupper(), my_string)))
>>> desired
('ABCD', 'efgh')
Efficient way (one pass):
>>> my_string = 'ABCDefgh'
>>> uppers = []
>>> done = False
>>> i = 0
>>> while not done:
... c = my_string[i]
... if c.isupper():
... uppers.append(c)
... i += 1
... else:
... done = True
...
>>> lowers = my_string[i:]
>>> desired = (''.join(uppers), lowers)
>>> desired
('ABCD', 'efgh')
Because I throw itertools.groupby at everything:
>>> my_string = 'ABCDefgh'
>>> from itertools import groupby
>>> [''.join(g) for k,g in groupby(my_string, str.isupper)]
['ABCD', 'efgh']
(A little overpowered here, but scales up to more complicated problems nicely.)
my_string='ABCDefg'
import re
desired = (re.search('[A-Z]+',my_string).group(0),re.search('[a-z]+',my_string).group(0))
print desired
A more robust approach without using re
import string
>>> txt = "ABCeUiioualfjNLkdD"
>>> tup = (''.join([char for char in txt if char in string.ascii_uppercase]),
''.join([char for char in txt if char not in string.ascii_uppercase]))
>>> tup
('ABCUNLD', 'eiioualfjkd')
the char not in string.ascii_uppercase instead of char in string.ascii_lowercase means that you'll never lose any data in case your string has non-letters in it, which could be useful if you suddenly start having errors when this input starts being rejected 20 function calls later.

Python Regular expression repeat

I have a string like this
--x123-09827--x456-9908872--x789-267504
I am trying to get all value like
123:09827
456:9908872
789:267504
I've tried (--x([0-9]+)-([0-9])+)+
but it only gives me last pair result, I am testing it through python
>>> import re
>>> x = "--x123-09827--x456-9908872--x789-267504"
>>> p = "(--x([0-9]+)-([0-9]+))+"
>>> re.match(p,x)
>>> re.match(p,x).groups()
('--x789-267504', '789', '267504')
How should I write with nested repeat pattern?
Thanks a lot!
David
Code it like this:
x = "--x123-09827--x456-9908872--x789-267504"
p = "--x(?:[0-9]+)-(?:[0-9]+)"
print re.findall(p,x)
Just use the .findall method instead, it makes the expression simpler.
>>> import re
>>> x = "--x123-09827--x456-9908872--x789-267504"
>>> r = re.compile(r"--x(\d+)-(\d+)")
>>> r.findall(x)
[('123', '09827'), ('456', '9908872'), ('789', '267504')]
You can also use .finditer which might be helpful for longer strings.
>>> [m.groups() for m in r.finditer(x)]
[('123', '09827'), ('456', '9908872'), ('789', '267504')]
Use re.finditer or re.findall. Then you don't need the extra pair of parentheses that wrap the entire expression. For example,
>>> import re
>>> x = "--x123-09827--x456-9908872--x789-267504"
>>> p = "--x([0-9]+)-([0-9]+)"
>>> for m in re.finditer(p,x):
>>> print '{0} {1}'.format(m.group(1),m.group(2))
try this
p='--x([0-9]+)-([0-9]+)'
re.findall(p,x)
No need to use regex :
>>> "--x123-09827--x456-9908872--x789-267504".replace('--x',' ').replace('-',':').strip()
'123:09827 456:9908872 789:267504'
You don't need regular expressions for this. Here is a simple one-liner, non-regex solution:
>>> input = "--x123-09827--x456-9908872--x789-267504"
>>> [ x.replace("-", ":") for x in input.split("--x")[1:] ]
['123:09827', '456:9908872', '789:267504']
If this is an exercise on regex, here is a solution that uses the repetition (technically), though the findall(...) solution may be preferred:
>>> import re
>>> input = "--x123-09827--x456-9908872--x789-267504"
>>> regex = '--x(.+)'
>>> [ x.replace("-", ":") for x in re.match(regex*3, input).groups() ]
['123:09827', '456:9908872', '789:267504']

How do I coalesce a sequence of identical characters into just one?

Suppose I have this:
My---sun--is------very-big---.
I want to replace all multiple hyphens with just one hyphen.
import re
astr='My---sun--is------very-big---.'
print(re.sub('-+','-',astr))
# My-sun-is-very-big-.
If you want to replace any run of consecutive characters, you can use
>>> import re
>>> a = "AA---BC++++DDDD-EE$$$$FF"
>>> print(re.sub(r"(.)\1+",r"\1",a))
A-BC+D-E$F
If you only want to coalesce non-word-characters, use
>>> print(re.sub(r"(\W)\1+",r"\1",a))
AA-BC+DDDD-EE$FF
If it's really just hyphens, I recommend unutbu's solution.
If you really only want to coalesce hyphens, use the other suggestions. Otherwise you can write your own function, something like this:
>>> def coalesce(x):
... n = []
... for c in x:
... if not n or c != n[-1]:
... n.append(c)
... return ''.join(n)
...
>>> coalesce('My---sun--is------very-big---.')
'My-sun-is-very-big-.'
>>> coalesce('aaabbbccc')
'abc'
As usual, there's a nice itertools solution, using groupby:
>>> from itertools import groupby
>>> s = 'aaaaa----bbb-----cccc----d-d-d'
>>> ''.join(key for key, group in groupby(s))
'a-b-c-d-d-d'
How about:
>>> import re
>>> re.sub("-+", "-", "My---sun--is------very-big---.")
'My-sun-is-very-big-.'
the regular expression "-+" will look for 1 or more "-".
re.sub('-+', '-', "My---sun--is------very-big---")
How about an alternate without the re module:
'-'.join(filter(lambda w: len(w) > 0, 'My---sun--is------very-big---.'.split("-")))
Or going with Tim and FogleBird's previous suggestion, here's a more general method:
def coalesce_factory(x):
return lambda sent: x.join(filter(lambda w: len(w) > 0, sent.split(x)))
hyphen_coalesce = coalesce_factory("-")
hyphen_coalesce('My---sun--is------very-big---.')
Though personally, I would use the re module first :)
mcpeterson
Another simple solution is the String object's replace function.
while '--' in astr:
astr = astr.replace('--','-')
if you don't want to use regular expressions:
my_string = my_string.split('-')
my_string = filter(None, my_string)
my_string = '-'.join(my_string)
I have
my_str = 'a, b,,,,, c, , , d'
I want
'a,b,c,d'
compress all the blanks (the "replace" bit), then split on the comma, then if not None join with a comma in between:
my_str_2 = ','.join([i for i in my_str.replace(" ", "").split(',') if i])

Categories

Resources