How to split a string containing digits and characters

How to split a string containing digits and characters - python

I want to split a long string (containing digits and characters in it without any space) in to different substrings in Python?
>>> s = "abc123cde4567"
after split will get
['abc', '123', 'cde', '4567']
Thank you!

>>> import re
>>> re.findall("[a-z]+|[0-9]+", "abc123cde4567")
['abc', '123', 'cde', '4567']

Something different from a regex:
from itertools import groupby
from string import digits
s = "abc123cde4567"
print [''.join(g) for k, g in groupby(s, digits.__contains__)]
# ['abc', '123', 'cde', '4567']

Related

Trailing empty string after re.split()

I have two strings where I want to isolate sequences of digits from everything else.
For example:
import re
s = 'abc123abc'
print(re.split('(\d+)', s))
s = 'abc123abc123'
print(re.split('(\d+)', s))
The output looks like this:
['abc', '123', 'abc']
['abc', '123', 'abc', '123', '']
Note that in the second case, there's a trailing empty string.
Obviously I can test for that and remove it if necessary but it seems cumbersome and I wondered if the RE can be improved to account for this scenario.

You can use filter and don't return this empty string like below:
>>> s = 'abc123abc123'
>>> re.split('(\d+)', s)
['abc', '123', 'abc', '123', '']
>>> list(filter(None,re.split('(\d+)', s)))
['abc', '123', 'abc', '123']
By thanks #chepner you can generate list comprehension like below:
>>> [x for x in re.split('(\d+)', s) if x]
['abc', '123', 'abc', '123']
If maybe you have symbols or other you need split:
>>> s = '&^%123abc123$##123'
>>> list(filter(None,re.split('(\d+)', s)))
['&^%', '123', 'abc', '123', '$##', '123']

This has to do with the implementation of re.split() itself: you can't change it. When the function splits, it doesn't check anything that comes after the capture group, so it can't choose for you to either keep or discard the empty string that is left after splitting. It just splits there and leaves the rest of the string (which can be empty) to the next cycle.
If you don't want that empty string, you can get rid of it in various ways before collecting the results into a list. user1740577's is one example, but personally I prefer a list comprehension, since it's more idiomatic for simple filter/map operations:
parts = [part for part in re.split('(\d+)', s) if part]
I recommend against checking and getting rid of the element after the list has already been created, because it involves more operations and allocations.

A simple way to use regular expressions for this would be re.findall:
def bits(s):
return re.findall(r"(\D+|\d+)", s)
bits("abc123abc123")
# ['abc', '123', 'abc', '123']
But it seems easier and more natural with itertools.groupby. After all, you are chunking an iterable based on a single condition:
from itertools import groupby
def bits(s):
return ["".join(g) for _, g in groupby(s, key=str.isdigit)]
bits("abc123abc123")
# ['abc', '123', 'abc', '123']

Python string to list?

I'm trying to convert string to a list
str = "ab(1234)bcta(45am)in23i(ab78lk)"
Expected Output
res_str = ["ab","bcta","in23i"]
I tried removing brackets from str.
re.sub(r'\([^)]*\)', '', str)

You may use a negated character class with a lookahead:
>>> s = "ab(1234)bcta(45am)in23i(ab78lk)"
>>> print (re.findall(r'[^()]+(?=\()', s))
['ab', 'bcta', 'in23i']
RegEx Details:
[^()]+: Match 1 of more of any character that is not ( and )
(?=\(): Lookahead to assert that there is a ( ahead

So many options here. One possibility would be using split:
import re
str = "ab(1234)bcta(45am)in23i(ab78lk)"
print(re.split(r'\(.*?\)', str)[:-1])
Returns:
['ab', 'bcta', 'in23i']
A second option would be to split by all paranthesis and slice your resulting array:
import re
str = "ab(1234)bcta(45am)in23i(ab78lk)"
print(re.split('[()]', str)[0:-1:2])
Where [0:-1:2] means to start at index 0, to stop at second to last index, and step two indices.

Use re.split
import re
str = "ab(1234)bcta(45am)in23i(ab78lk)"
print(re.split('\(.*?\)', str))
Returns:
['ab', 'bcta', 'in23i', '']
If you want to get rid of empty strings in your list, you may use a filter:
print(list(filter(None, re.split('\(.*?\)', str))))
Returns:
['ab', 'bcta', 'in23i']

You may match all alphanumeric characters followed by a ( :
>>> re.findall('\w+(?=\()',str)
['ab', 'bcta', 'in23i']
or using re.sub as you were:
>>> re.sub('\([^)]+\)',' ',str).split()
['ab', 'bcta', 'in23i']

Just for the sake of complexity :
>>>> str = "ab(1234)bcta(45am)in23i(ab78lk)"
>>>> res_str = [y[-1] for y in [ x.split(')') for x in str.split('(')]][0:-1]
['ab', 'bcta', 'in23i']

Extract text from parenthesis

How can I extract the text enclosed within the parenthesis from the following string:
string = '{a=[], b=[abc, def], c=[ghi], d=[], e=[jkl], f=[mno, pqr, stu, vwx]}'
Expected Output is:
['abc','def','ghi','jkl','mno','pqr','stu','vwx']

Regex should help.
import re
string = '{a=[], b=[abc, def], c=[ghi], d=[], e=[jkl], f=[mno, pqr, stu, vwx]}'
res = []
for i in re.findall("\[(.*?)\]", string):
res.extend(i.replace(",", "").split())
print res
Output:
['abc', 'def', 'ghi', 'jkl', 'mno', 'pqr', 'stu', 'vwx']

An alternative using the newer regex module could be:
(?:\G(?!\A)|\[)([^][,]+)(?:,\s*)?
Broken down, this says:
(?:\G(?!\A)|\[) # match either [ or at the end of the last match
([^][,]+) # capture anything not [ or ] or ,
(?:,\s*)? # followed by , and whitespaces, eventually
See a demo on regex101.com.
In Python:
import regex as re
string = '{a=[], b=[abc, def], c=[ghi], d=[], e=[jkl], f=[mno, pqr, stu, vwx]}'
rx = re.compile(r'(?:\G(?!\A)|\[)([^][,]+)(?:,\s*)?')
output = rx.findall(string)
print(output)
# ['abc', 'def', 'ghi', 'jkl', 'mno', 'pqr', 'stu', 'vwx']

python how to split string with more than one character?

I would like to split a string as below
1234ABC into 123 and ABC
2B into 2 and B
10E into 10 and E
I found split function does not work because there is no delimiter

You can use itertools.groupby with boolean isdigit function.
from itertools import groupby
test1 = '123ABC'
test2 = '2B'
test3 = '10E'
def custom_split(s):
return [''.join(gp) for _, gp in groupby(s, lambda char: char.isdigit())]
for t in [test1, test2, test3]:
print(custom_split(t))
# ['123', 'ABC']
# ['2', 'B']
# ['10', 'E']

This can quite easily be accomplished using the re module:
>>> import re
>>>
>>> re.findall('[a-zA-Z]+|[0-9]+', '1234ABC')
['1234', 'ABC']
>>> re.findall('[a-zA-Z]+|[0-9]+', '2B')
['2', 'B']
>>> re.findall('[a-zA-Z]+|[0-9]+', '10E')
['10', 'E']
>>> # addtionall test case
...
>>> re.findall('[a-zA-Z]+|[0-9]+', 'abcd1234efgh5678')
['abcd', '1234', 'efgh', '5678']
>>>
The regex use is very simple. Here is quick walk through:
[a-zA-Z]+: Match one or more alphabetic characters lower case or upper
| or...
[0-9]+: One or more whole numbers

Another way to solve it using re package
r = re.search('([0-9]*)([a-zA-Z]*)', test_string)
r.groups()

Search a list of strings with a list of substrings

I have a list of strings and currently I can search for one substring at the time:
str = ['abc', 'efg', 'xyz']
[s for s in str if "a" in s]
which correctly returns
['abc']
Now let's say I have a list of substrings instead:
subs = ['a', 'ef']
I want a command like
[s for s in str if anyof(subs) in s]
which should return
['abc', 'efg']

>>> s = ['abc', 'efg', 'xyz']
>>> subs = ['a', 'ef']
>>> [x for x in s if any(sub in x for sub in subs)]
['abc', 'efg']
Don't use str as a variable name, it's a builtin.

Gets a little convoluted but you could do
[s for s in str if any([sub for sub in subs if sub in s])]

Simply use them one after the other:
[s for s in str for r in subs if r in s]
>>> r = ['abc', 'efg', 'xyz']
>>> s = ['a', 'ef']
>>> [t for t in r for x in s if x in t]
['abc', 'efg']

I still like map and filter, despite what is being said against and how comprehension can always replace a map and a filter. Hence, here is a map + filter + lambda version:
print filter(lambda x: any(map(x.__contains__,subs)), s)
which reads:
filter elements of s that contain any element from subs
I like how this uses words that carry a strong semantic meaning, rather than only if, for, in

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split a string containing digits and characters - python

I want to split a long string (containing digits and characters in it without any space) in to different substrings in Python? >>> s = "abc123cde4567" after split will get ['abc', '123', 'cde', '4567'] Thank you!

>>> import re >>> re.findall("[a-z]+|[0-9]+", "abc123cde4567") ['abc', '123', 'cde', '4567']

Something different from a regex: from itertools import groupby from string import digits s = "abc123cde4567" print [''.join(g) for k, g in groupby(s, digits.contains)] # ['abc', '123', 'cde', '4567']

Related

Trailing empty string after re.split()

Python string to list?

Extract text from parenthesis

python how to split string with more than one character?

Search a list of strings with a list of substrings

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split a string containing digits and characters - python

I want to split a long string (containing digits and characters in it without any space) in to different substrings in Python? >>> s = "abc123cde4567" after split will get ['abc', '123', 'cde', '4567'] Thank you!

>>> import re >>> re.findall("[a-z]+|[0-9]+", "abc123cde4567") ['abc', '123', 'cde', '4567']

Something different from a regex: from itertools import groupby from string import digits s = "abc123cde4567" print [''.join(g) for k, g in groupby(s, digits.__contains__)] # ['abc', '123', 'cde', '4567']

Related

Trailing empty string after re.split()

Python string to list?

Extract text from parenthesis

python how to split string with more than one character?

Search a list of strings with a list of substrings

Categories

Resources

Something different from a regex: from itertools import groupby from string import digits s = "abc123cde4567" print [''.join(g) for k, g in groupby(s, digits.contains)] # ['abc', '123', 'cde', '4567']