python how to split string with more than one character? - python

I would like to split a string as below
1234ABC into 123 and ABC
2B into 2 and B
10E into 10 and E
I found split function does not work because there is no delimiter

You can use itertools.groupby with boolean isdigit function.
from itertools import groupby
test1 = '123ABC'
test2 = '2B'
test3 = '10E'
def custom_split(s):
return [''.join(gp) for _, gp in groupby(s, lambda char: char.isdigit())]
for t in [test1, test2, test3]:
print(custom_split(t))
# ['123', 'ABC']
# ['2', 'B']
# ['10', 'E']

This can quite easily be accomplished using the re module:
>>> import re
>>>
>>> re.findall('[a-zA-Z]+|[0-9]+', '1234ABC')
['1234', 'ABC']
>>> re.findall('[a-zA-Z]+|[0-9]+', '2B')
['2', 'B']
>>> re.findall('[a-zA-Z]+|[0-9]+', '10E')
['10', 'E']
>>> # addtionall test case
...
>>> re.findall('[a-zA-Z]+|[0-9]+', 'abcd1234efgh5678')
['abcd', '1234', 'efgh', '5678']
>>>
The regex use is very simple. Here is quick walk through:
[a-zA-Z]+: Match one or more alphabetic characters lower case or upper
| or...
[0-9]+: One or more whole numbers

Another way to solve it using re package
r = re.search('([0-9]*)([a-zA-Z]*)', test_string)
r.groups()

Related

Trailing empty string after re.split()

I have two strings where I want to isolate sequences of digits from everything else.
For example:
import re
s = 'abc123abc'
print(re.split('(\d+)', s))
s = 'abc123abc123'
print(re.split('(\d+)', s))
The output looks like this:
['abc', '123', 'abc']
['abc', '123', 'abc', '123', '']
Note that in the second case, there's a trailing empty string.
Obviously I can test for that and remove it if necessary but it seems cumbersome and I wondered if the RE can be improved to account for this scenario.
You can use filter and don't return this empty string like below:
>>> s = 'abc123abc123'
>>> re.split('(\d+)', s)
['abc', '123', 'abc', '123', '']
>>> list(filter(None,re.split('(\d+)', s)))
['abc', '123', 'abc', '123']
By thanks #chepner you can generate list comprehension like below:
>>> [x for x in re.split('(\d+)', s) if x]
['abc', '123', 'abc', '123']
If maybe you have symbols or other you need split:
>>> s = '&^%123abc123$##123'
>>> list(filter(None,re.split('(\d+)', s)))
['&^%', '123', 'abc', '123', '$##', '123']
This has to do with the implementation of re.split() itself: you can't change it. When the function splits, it doesn't check anything that comes after the capture group, so it can't choose for you to either keep or discard the empty string that is left after splitting. It just splits there and leaves the rest of the string (which can be empty) to the next cycle.
If you don't want that empty string, you can get rid of it in various ways before collecting the results into a list. user1740577's is one example, but personally I prefer a list comprehension, since it's more idiomatic for simple filter/map operations:
parts = [part for part in re.split('(\d+)', s) if part]
I recommend against checking and getting rid of the element after the list has already been created, because it involves more operations and allocations.
A simple way to use regular expressions for this would be re.findall:
def bits(s):
return re.findall(r"(\D+|\d+)", s)
bits("abc123abc123")
# ['abc', '123', 'abc', '123']
But it seems easier and more natural with itertools.groupby. After all, you are chunking an iterable based on a single condition:
from itertools import groupby
def bits(s):
return ["".join(g) for _, g in groupby(s, key=str.isdigit)]
bits("abc123abc123")
# ['abc', '123', 'abc', '123']

Python string to list?

I'm trying to convert string to a list
str = "ab(1234)bcta(45am)in23i(ab78lk)"
Expected Output
res_str = ["ab","bcta","in23i"]
I tried removing brackets from str.
re.sub(r'\([^)]*\)', '', str)
You may use a negated character class with a lookahead:
>>> s = "ab(1234)bcta(45am)in23i(ab78lk)"
>>> print (re.findall(r'[^()]+(?=\()', s))
['ab', 'bcta', 'in23i']
RegEx Details:
[^()]+: Match 1 of more of any character that is not ( and )
(?=\(): Lookahead to assert that there is a ( ahead
So many options here. One possibility would be using split:
import re
str = "ab(1234)bcta(45am)in23i(ab78lk)"
print(re.split(r'\(.*?\)', str)[:-1])
Returns:
['ab', 'bcta', 'in23i']
A second option would be to split by all paranthesis and slice your resulting array:
import re
str = "ab(1234)bcta(45am)in23i(ab78lk)"
print(re.split('[()]', str)[0:-1:2])
Where [0:-1:2] means to start at index 0, to stop at second to last index, and step two indices.
Use re.split
import re
str = "ab(1234)bcta(45am)in23i(ab78lk)"
print(re.split('\(.*?\)', str))
Returns:
['ab', 'bcta', 'in23i', '']
If you want to get rid of empty strings in your list, you may use a filter:
print(list(filter(None, re.split('\(.*?\)', str))))
Returns:
['ab', 'bcta', 'in23i']
You may match all alphanumeric characters followed by a ( :
>>> re.findall('\w+(?=\()',str)
['ab', 'bcta', 'in23i']
or using re.sub as you were:
>>> re.sub('\([^)]+\)',' ',str).split()
['ab', 'bcta', 'in23i']
Just for the sake of complexity :
>>>> str = "ab(1234)bcta(45am)in23i(ab78lk)"
>>>> res_str = [y[-1] for y in [ x.split(')') for x in str.split('(')]][0:-1]
['ab', 'bcta', 'in23i']

How to convert a string into a list without including a specific character plus without using replace and split methods?

Let's say you have:
x = "1,2,13"
and you want to achieve:
list = ["1","2","13"]
Can you do it without the split and replace methods?
What I have tried:
list=[]
for number in x:
if number != ",":
list.append(number)
print(list) # ['1', '2', '1', '3']
but this works only if its a single digit
You could use a regular expression:
>>> import re
>>> re.findall('(\d+)', '123,456')
['123', '456']
Here is a way using that assumes integers using itertools:
>>> import itertools
>>> x = "1,88,22"
>>> ["".join(g) for b,g in itertools.groupby(x,str.isdigit) if b]
['1', '88', '22']
>>>
Here is a method that uses traditional looping:
>>> digit = ""
>>> digit_list = []
>>> for c in x:
... if c.isdigit():
... digit += c
... elif c == ",":
... digit_list.append(digit)
... digit = ""
... else:
... digit_list.append(digit)
...
>>> digit_list
['1', '88', '22']
>>>
In the real world, you'd probably just use regex...

How to split a string containing digits and characters

I want to split a long string (containing digits and characters in it without any space) in to different substrings in Python?
>>> s = "abc123cde4567"
after split will get
['abc', '123', 'cde', '4567']
Thank you!
>>> import re
>>> re.findall("[a-z]+|[0-9]+", "abc123cde4567")
['abc', '123', 'cde', '4567']
Something different from a regex:
from itertools import groupby
from string import digits
s = "abc123cde4567"
print [''.join(g) for k, g in groupby(s, digits.__contains__)]
# ['abc', '123', 'cde', '4567']

Splitting a string into a list (but not separating adjacent numbers) in Python

For example, I have:
string = "123ab4 5"
I want to be able to get the following list:
["123","ab","4","5"]
rather than list(string) giving me:
["1","2","3","a","b","4"," ","5"]
Find one or more adjacent digits (\d+), or if that fails find non-digit, non-space characters ([^\d\s]+).
>>> string = '123ab4 5'
>>> import re
>>> re.findall('\d+|[^\d\s]+', string)
['123', 'ab', '4', '5']
If you don't want the letters joined together, try this:
>>> re.findall('\d+|\S', string)
['123', 'a', 'b', '4', '5']
The other solutions are definitely easier. If you want something far less straightforward, you could try something like this:
>>> import string
>>> from itertools import groupby
>>> s = "123ab4 5"
>>> result = [''.join(list(v)) for _, v in groupby(s, key=lambda x: x.isdigit())]
>>> result = [x for x in result if x not in string.whitespace]
>>> result
['123', 'ab', '4', '5']
You could do:
>>> [el for el in re.split('(\d+)', string) if el.strip()]
['123', 'ab', '4', '5']
This will give the split you want:
re.findall(r'\d+|[a-zA-Z]+', "123ab4 5")
['123', 'ab', '4', '5']
you can do a few things here, you can
1. iterate the list and make groups of numbers as you go, appending them to your results list.
not a great solution.
2. use regular expressions.
implementation of 2:
>>> import re
>>> s = "123ab4 5"
>>> re.findall('\d+|[^\d]', s)
['123', 'a', 'b', '4', ' ', '5']
you want to grab any group which is at least 1 number \d+ or any other character.
edit
John beat me to the correct solution first. and its a wonderful solution.
i will leave this here though because someone else might misunderstand the question and look for an answer to what i thought was written also. i was under the impression the OP wanted to capture only groups of numbers, and leave everything else individual.

Categories

Resources