Trailing empty string after re.split()

Trailing empty string after re.split() - python

I have two strings where I want to isolate sequences of digits from everything else.
For example:
import re
s = 'abc123abc'
print(re.split('(\d+)', s))
s = 'abc123abc123'
print(re.split('(\d+)', s))
The output looks like this:
['abc', '123', 'abc']
['abc', '123', 'abc', '123', '']
Note that in the second case, there's a trailing empty string.
Obviously I can test for that and remove it if necessary but it seems cumbersome and I wondered if the RE can be improved to account for this scenario.

You can use filter and don't return this empty string like below:
>>> s = 'abc123abc123'
>>> re.split('(\d+)', s)
['abc', '123', 'abc', '123', '']
>>> list(filter(None,re.split('(\d+)', s)))
['abc', '123', 'abc', '123']
By thanks #chepner you can generate list comprehension like below:
>>> [x for x in re.split('(\d+)', s) if x]
['abc', '123', 'abc', '123']
If maybe you have symbols or other you need split:
>>> s = '&^%123abc123$##123'
>>> list(filter(None,re.split('(\d+)', s)))
['&^%', '123', 'abc', '123', '$##', '123']

This has to do with the implementation of re.split() itself: you can't change it. When the function splits, it doesn't check anything that comes after the capture group, so it can't choose for you to either keep or discard the empty string that is left after splitting. It just splits there and leaves the rest of the string (which can be empty) to the next cycle.
If you don't want that empty string, you can get rid of it in various ways before collecting the results into a list. user1740577's is one example, but personally I prefer a list comprehension, since it's more idiomatic for simple filter/map operations:
parts = [part for part in re.split('(\d+)', s) if part]
I recommend against checking and getting rid of the element after the list has already been created, because it involves more operations and allocations.

A simple way to use regular expressions for this would be re.findall:
def bits(s):
return re.findall(r"(\D+|\d+)", s)
bits("abc123abc123")
# ['abc', '123', 'abc', '123']
But it seems easier and more natural with itertools.groupby. After all, you are chunking an iterable based on a single condition:
from itertools import groupby
def bits(s):
return ["".join(g) for _, g in groupby(s, key=str.isdigit)]
bits("abc123abc123")
# ['abc', '123', 'abc', '123']

Related

Search a list of strings with a list of substrings

I have a list of strings and currently I can search for one substring at the time:
str = ['abc', 'efg', 'xyz']
[s for s in str if "a" in s]
which correctly returns
['abc']
Now let's say I have a list of substrings instead:
subs = ['a', 'ef']
I want a command like
[s for s in str if anyof(subs) in s]
which should return
['abc', 'efg']

>>> s = ['abc', 'efg', 'xyz']
>>> subs = ['a', 'ef']
>>> [x for x in s if any(sub in x for sub in subs)]
['abc', 'efg']
Don't use str as a variable name, it's a builtin.

Gets a little convoluted but you could do
[s for s in str if any([sub for sub in subs if sub in s])]

Simply use them one after the other:
[s for s in str for r in subs if r in s]
>>> r = ['abc', 'efg', 'xyz']
>>> s = ['a', 'ef']
>>> [t for t in r for x in s if x in t]
['abc', 'efg']

I still like map and filter, despite what is being said against and how comprehension can always replace a map and a filter. Hence, here is a map + filter + lambda version:
print filter(lambda x: any(map(x.__contains__,subs)), s)
which reads:
filter elements of s that contain any element from subs
I like how this uses words that carry a strong semantic meaning, rather than only if, for, in

How to split a string containing digits and characters

I want to split a long string (containing digits and characters in it without any space) in to different substrings in Python?
>>> s = "abc123cde4567"
after split will get
['abc', '123', 'cde', '4567']
Thank you!

>>> import re
>>> re.findall("[a-z]+|[0-9]+", "abc123cde4567")
['abc', '123', 'cde', '4567']

Something different from a regex:
from itertools import groupby
from string import digits
s = "abc123cde4567"
print [''.join(g) for k, g in groupby(s, digits.__contains__)]
# ['abc', '123', 'cde', '4567']

Splitting a string into a list (but not separating adjacent numbers) in Python

For example, I have:
string = "123ab4 5"
I want to be able to get the following list:
["123","ab","4","5"]
rather than list(string) giving me:
["1","2","3","a","b","4"," ","5"]

Find one or more adjacent digits (\d+), or if that fails find non-digit, non-space characters ([^\d\s]+).
>>> string = '123ab4 5'
>>> import re
>>> re.findall('\d+|[^\d\s]+', string)
['123', 'ab', '4', '5']
If you don't want the letters joined together, try this:
>>> re.findall('\d+|\S', string)
['123', 'a', 'b', '4', '5']

The other solutions are definitely easier. If you want something far less straightforward, you could try something like this:
>>> import string
>>> from itertools import groupby
>>> s = "123ab4 5"
>>> result = [''.join(list(v)) for _, v in groupby(s, key=lambda x: x.isdigit())]
>>> result = [x for x in result if x not in string.whitespace]
>>> result
['123', 'ab', '4', '5']

You could do:
>>> [el for el in re.split('(\d+)', string) if el.strip()]
['123', 'ab', '4', '5']

This will give the split you want:
re.findall(r'\d+|[a-zA-Z]+', "123ab4 5")
['123', 'ab', '4', '5']

you can do a few things here, you can
1. iterate the list and make groups of numbers as you go, appending them to your results list.
not a great solution.
2. use regular expressions.
implementation of 2:
>>> import re
>>> s = "123ab4 5"
>>> re.findall('\d+|[^\d]', s)
['123', 'a', 'b', '4', ' ', '5']
you want to grab any group which is at least 1 number \d+ or any other character.
edit
John beat me to the correct solution first. and its a wonderful solution.
i will leave this here though because someone else might misunderstand the question and look for an answer to what i thought was written also. i was under the impression the OP wanted to capture only groups of numbers, and leave everything else individual.

Splitting strings in python based on index

This sounds pretty basic but I ca't think of a neat straightforward method to do this in Python yet
I have a string like "abcdefgh" and I need to create a list of elements picking two characters at a time from the string to get ['ab','cd','ef','gh'].
What I am doing right now is this
output = []
for i in range(0,len(input),2):
output.append(input[i:i+2])
Is there a nicer way?

In [2]: s = 'abcdefgh'
In [3]: [s[i:i+2] for i in range(0, len(s), 2)]
Out[3]: ['ab', 'cd', 'ef', 'gh']

Just for the fun of it, if you hate for
>>> s='abcdefgh'
>>> map(''.join, zip(s[::2], s[1::2]))
['ab', 'cd', 'ef', 'gh']

Is there a nicer way?
Sure. List comprehension can do that.
def n_chars_at_a_time(s, n=2):
return [s[i:i+n] for i in xrange(0, len(s), n)]
should do what you want. The s[i:i+n] returns the substring starting at i and ending n characters later.
n_chars_at_a_time("foo bar baz boo", 2)
produces
['fo', 'o ', 'ba', 'r ', 'ba', 'z ', 'bo', 'o']
in the python REPL.
For more info see Generator Expressions and List Comprehensions:
Two common operations on an iterator’s output are
performing some operation for every element,
selecting a subset of elements that meet some condition.
For example, given a list of strings, you might want to strip off trailing whitespace from each line or extract all the strings containing a given substring.
List comprehensions and generator expressions (short form: “listcomps” and “genexps”) are a concise notation for such operations...

Split by comma and strip whitespace in Python

I have some python code that splits on comma, but doesn't strip the whitespace:
>>> string = "blah, lots , of , spaces, here "
>>> mylist = string.split(',')
>>> print mylist
['blah', ' lots ', ' of ', ' spaces', ' here ']
I would rather end up with whitespace removed like this:
['blah', 'lots', 'of', 'spaces', 'here']
I am aware that I could loop through the list and strip() each item but, as this is Python, I'm guessing there's a quicker, easier and more elegant way of doing it.

Use list comprehension -- simpler, and just as easy to read as a for loop.
my_string = "blah, lots , of , spaces, here "
result = [x.strip() for x in my_string.split(',')]
# result is ["blah", "lots", "of", "spaces", "here"]
See: Python docs on List Comprehension
A good 2 second explanation of list comprehension.

I came to add:
map(str.strip, string.split(','))
but saw it had already been mentioned by Jason Orendorff in a comment.
Reading Glenn Maynard's comment on the same answer suggesting list comprehensions over map I started to wonder why. I assumed he meant for performance reasons, but of course he might have meant for stylistic reasons, or something else (Glenn?).
So a quick (possibly flawed?) test on my box (Python 2.6.5 on Ubuntu 10.04) applying the three methods in a loop revealed:
$ time ./list_comprehension.py # [word.strip() for word in string.split(',')]
real 0m22.876s
$ time ./map_with_lambda.py # map(lambda s: s.strip(), string.split(','))
real 0m25.736s
$ time ./map_with_str.strip.py # map(str.strip, string.split(','))
real 0m19.428s
making map(str.strip, string.split(',')) the winner, although it seems they are all in the same ballpark.
Certainly though map (with or without a lambda) should not necessarily be ruled out for performance reasons, and for me it is at least as clear as a list comprehension.

Split using a regular expression. Note I made the case more general with leading spaces. The list comprehension is to remove the null strings at the front and back.
>>> import re
>>> string = " blah, lots , of , spaces, here "
>>> pattern = re.compile("^\s+|\s*,\s*|\s+$")
>>> print([x for x in pattern.split(string) if x])
['blah', 'lots', 'of', 'spaces', 'here']
This works even if ^\s+ doesn't match:
>>> string = "foo, bar "
>>> print([x for x in pattern.split(string) if x])
['foo', 'bar']
>>>
Here's why you need ^\s+:
>>> pattern = re.compile("\s*,\s*|\s+$")
>>> print([x for x in pattern.split(string) if x])
[' blah', 'lots', 'of', 'spaces', 'here']
See the leading spaces in blah?
Clarification: above uses the Python 3 interpreter, but results are the same in Python 2.

Just remove the white space from the string before you split it.
mylist = my_string.replace(' ','').split(',')

I know this has already been answered, but if you end doing this a lot, regular expressions may be a better way to go:
>>> import re
>>> re.sub(r'\s', '', string).split(',')
['blah', 'lots', 'of', 'spaces', 'here']
The \s matches any whitespace character, and we just replace it with an empty string ''. You can find more info here: http://docs.python.org/library/re.html#re.sub

map(lambda s: s.strip(), mylist) would be a little better than explicitly looping. Or for the whole thing at once: map(lambda s:s.strip(), string.split(','))

import re
result=[x for x in re.split(',| ',your_string) if x!='']
this works fine for me.

re (as in regular expressions) allows splitting on multiple characters at once:
$ string = "blah, lots , of , spaces, here "
$ re.split(', ',string)
['blah', 'lots ', ' of ', ' spaces', 'here ']
This doesn't work well for your example string, but works nicely for a comma-space separated list. For your example string, you can combine the re.split power to split on regex patterns to get a "split-on-this-or-that" effect.
$ re.split('[, ]',string)
['blah',
'',
'lots',
'',
'',
'',
'',
'of',
'',
'',
'',
'spaces',
'',
'here',
'']
Unfortunately, that's ugly, but a filter will do the trick:
$ filter(None, re.split('[, ]',string))
['blah', 'lots', 'of', 'spaces', 'here']
Voila!

s = 'bla, buu, jii'
sp = []
sp = s.split(',')
for st in sp:
print st

import re
mylist = [x for x in re.compile('\s*[,|\s+]\s*').split(string)]
Simply, comma or at least one white spaces with/without preceding/succeeding white spaces.
Please try!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trailing empty string after re.split() - python

Related

Search a list of strings with a list of substrings

How to split a string containing digits and characters

Splitting a string into a list (but not separating adjacent numbers) in Python

Splitting strings in python based on index

Split by comma and strip whitespace in Python

Categories

Resources