Python 3 split() - python

When I'm splitting a string "abac" I'm getting undesired results.
Example
print("abac".split("a"))
Why does it print:
['', 'b', 'c']
instead of
['b', 'c']
Can anyone explain this behavior and guide me on how to get my desired output?
Thanks in advance.

As #DeepSpace pointed out (referring to the docs)
If sep is given, consecutive delimiters are not grouped together and are deemed to delimit empty strings (for example, '1,,2'.split(',') returns ['1', '', '2']).
Therefore I'd suggest using a better delimiter such as a comma , or if this is the formatting you're stuck with then you could just use the builtin filter() function as suggested in this answer, this will remove any "empty" strings if passed None as the function.
sample = 'abac'
filtered_sample = filter(None, sample.split('a'))
print(filtered_sample)
#['b', 'c']

When you split a string in python you keep everything between your delimiters (even when it's an empty string!)
For example, if you had a list of letters separated by commas:
>>> "a,b,c,d".split(',')
['a','b','c','d']
If your list had some missing values you might leave the space in between the commas blank:
>>> "a,b,,d".split(',')
['a','b','','d']
The start and end of the string act as delimiters themselves, so if you have a leading or trailing delimiter you will also get this "empty string" sliced out of your main string:
>>> "a,b,c,d,,".split(',')
['a','b','c','d','','']
>>> ",a,b,c,d".split(',')
['','a','b','c','d']
If you want to get rid of any empty strings in your output, you can use the filter function.
If instead you just want to get rid of this behavior near the edges of your main string, you can strip the delimiters off first:
>>> ",,a,b,c,d".strip(',')
"a,b,c,d"
>>> ",,a,b,c,d".strip(',').split(',')
['a','b','c','d']

In your example, "a" is what's called a delimiter. It acts as a boundary between the characters before it and after it. So, when you call split, it gets the characters before "a" and after "a" and inserts it into the list. Since there's nothing in front of the first "a" in the string "abac", it returns an empty string and inserts it into the list.

split will return the characters between the delimiters you specify (or between an end of the string and a delimiter), even if there aren't any, in which case it will return an empty string. (See the documentation for more information.)
In this case, if you don't want any empty strings in the output, you can use filter to remove them:
list(filter(lambda s: len(s) > 0, "abac".split("a"))

Related

Python: list strip overkill

I just want to remove the '.SI' in the list but it will overkill by remove any that contain S or I in the list.
ab = ['abc.SI','SIV.SI','ggS.SI']
[x.strip('.SI') for x in ab]
>> ['abc','V','gg']
output which I want is
>> ['abc','SIV','ggS']
any elegant way to do it? prefer not to use for loop as my list is long
Why strip ? you can use .replace():
[x.replace('.SI', '') for x in ab]
Output:
['abc', 'SIV', 'ggS']
(this will remove .SI anywhere, have a look at other answers if you want to remove it only at the end)
The reason strip() doesn't work is explained in the docs:
The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped
So it will strip any character in the string that you pass as an argument.
If you want to remove the substring only from the end, the correct way to achieve this will be:
>>> ab = ['abc.SI','SIV.SI','ggS.SI']
>>> sub_string = '.SI'
# checks the presence of substring at the end
# v
>>> [s[:-len(sub_string)] if s.endswith(sub_string) else s for s in ab]
['abc', 'SIV', 'ggS']
Because str.replace() (as mentioned in TrakJohnson's answer) removes the substring even if it is within the middle of string. For example:
>>> 'ab.SIrt'.replace('.SI', '')
'abrt'
use this [x[:-3] for x in ab].
Use split instead of strip and get the first element:
[x.split('.SI')[0] for x in ab]

How to remove falsy values when splitting a string with a non-whitespace separator

According to the docs:
str.split(sep=None, maxsplit=-1)
If sep is given, consecutive delimiters are not grouped together and
are deemed to delimit empty strings (for example, '1,,2'.split(',')
returns ['1', '', '2']). The sep argument may consist of multiple
characters (for example, '1<>2<>3'.split('<>') returns ['1', '2', '3']).
Splitting an empty string with a specified separator returns
[''].
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
So to use the keyword argument sep=, is the following the pythonic way to remove the falsy items?
[w for w in s.strip().split(' ') if w]
If it's just whitespaces (\s\t\n), str.split() will suffice but let's say we are trying to split on another character/substring, the if-condition in the list comprehension is necessary. Is that right?
If you want to be obtuse, you could use filter(None, x) to remove falsey items:
>>> list(filter(None, '1,2,,3,'.split(',')))
['1', '2', '3']
Probably less Pythonic. It might be clearer to iterate over the items specifically:
for w in '1,2,,3,'.split(','):
if w:
…
This makes it clear that you're skipping the empty items and not relying on the fact that str.split sometimes skips empty items.
I'd just as soon use a regex, either to skip consecutive runs of the separator (but watch out for the end):
>>> re.split(r',+', '1,2,,3,')
['1', '2', '3', '']
or to find everything that's not a separator:
>>> re.findall(r'[^,]+', '1,2,,3,')
['1', '2', '3']
If you want to go way back in Python's history, there were two separate functions, split and splitfields. I think the name explains the purpose. The first splits on any whitespace, useful for arbitrary text input, and the second behaves predictably on some delimited input. They were implemented in pure Python before v1.6.
Well, I think you might just need a hand in understanding the documentation. In your example, you pretty much are demonstrating the differences in the algorithm mentioned in documentation. Not using the sep keyword argument more or less is like using sep=' ' and then throwing out the empty strings. When you have multiple spaces in a row the algorithm splits those and finds None. Because you were explicit that you wanted everything split by a space it converts None to an empty string. Changing None to an empty string is good practice in this case, because it avoids changing the signature of the function (or in other words what the functions returns), in this case it returns a list of strings.
Below is showing how an empty string with 4 spaces is treated differently...
>>> empty = ' '
>>> s = 'this is an irritating string with random spacing .'
>>> empty.split()
[]
>>> empty.split(' ')
['', '', '', '']
For you question, just use split() with no sep argument
well your string
s = 'this is an irritating string with random spacing .',
which is containing more than one white spaces that's why empty.split(' ') is returning noney value.
you would have to remove extra white space from string s and can get desired result.

Last element in python list, created by splitting a string is empty

So I have a string which I need to parse. The string contains a number of words, separated by a hyphen (-). The string also ends with a hyphen.
For example one-two-three-.
Now, if I want to look at the words on their own, I split up the string to a list.
wordstring = "one-two-three-"
wordlist = wordstring.split('-')
for i in range(0, len(wordlist)):
print(wordlist[i])
Output
one
two
three
#empty element
What I don't understand is, why in the resulting list, the final element is an empty string.
How can I omit this empty element?
Should I simply truncate the list or is there a better way to split the string?
You have an empty string because the split on the last - character produces an empty string on the RHS. You can strip all '-' characters from the string before splitting:
wordlist = wordstring.strip('-').split('-')
If the final element is always a - character, you can omit it by using [:-1] which grabs all the elements of the string besides the last character.
Then, proceed to split it as you did:
wordlist = wordstring[:-1].split('-')
print(wordlist)
['one', 'two', 'three']
You can use regex to do this :
import re
wordlist = re.findall("[a-zA-Z]+(?=-)", wordstring)
Output :
['one', 'two', 'three']
You should use the strip built-in function of Python before splitting your String. E.g:
wordstring = "one-two-three-"
wordlist = wordstring.strip('-').split('-')
I believe .split() is assuming there is another element after the last - but it is obviously a blank entry.
Are you open to removing the dash in wordstring before splitting it?
wordstring = "one-two-three-"
wordlist = wordstring[:-1].split('-')
print wordlist
OUT: 'one-two-three'
This is explained in the docs:
...
If sep is given, consecutive delimiters are not grouped together and are deemed to delimit empty strings (for example, '1,,2'.split(',') returns ['1', '', '2']).
...
If you know your strings will always end in '-', then just remove the last one by doing wordlist.pop().
If you need something more complicated you may want to learn about regular expressions.
Just for the variaty of options:
wordlist = [x for x in wordstring.split('-') if x]
Note that the above also handles cases such as: wordstring = "one-two--three-" (double hyphen)
First strip() then split()
wordstring = "one-two-three-"
x = wordstring.strip('-')
y = x.split('-')
for word in y:
print word
Strip/trim the string before splitting. This way you will remove the trailing "\n" and you should be fine.

Find Certain String Indices

I have this string and I need to get a specific number out of it.
E.G. encrypted = "10134585588147, 3847183463814, 18517461398"
How would I pull out only the second integer out of the string?
You are looking for the "split" method. Turn a string into a list by specifying a smaller part of the string on which to split.
>>> encrypted = '10134585588147, 3847183463814, 18517461398'
>>> encrypted_list = encrypted.split(', ')
>>> encrypted_list
['10134585588147', '3847183463814', '18517461398']
>>> encrypted_list[1]
'3847183463814'
>>> encrypted_list[-1]
'18517461398'
Then you can just access the indices as normal. Note that lists can be indexed forwards or backwards. By providing a negative index, we count from the right rather than the left, selecting the last index (without any idea how big the list is). Note this will produce IndexError if the list is empty, though. If you use Jon's method (below), there will always be at least one index in the list unless the string you start with is itself empty.
Edited to add:
What Jon is pointing out in the comment is that if you are not sure if the string will be well-formatted (e.g., always separated by exactly one comma followed by exactly one space), then you can replace all the commas with spaces (encrypt.replace(',', ' ')), then call split without arguments, which will split on any number of whitespace characters. As usual, you can chain these together:
encrypted.replace(',', ' ').split()

Python - regex, blank element at the end of the list?

I have a code
print(re.split(r"[\s\?\!\,\;]+", "Holy moly, feferoni!"))
which results
['Holy', 'moly', 'feferoni', '']
How can I get rid of this last blank element, what caused it?
If this is a dirty way to get rid of punctuation and spaces from a string, how else can I write but in regex?
Expanding on what #HamZa said in his comment, you would use re.findall and a negative character set:
>>> from re import findall
>>> findall(r"[^\s?!,;]+", "Holy moly, feferoni!")
['Holy', 'moly', 'feferoni']
>>>
You get the empty string as the last element of you list, because the RegEx splits after the last !. It ends up giving you what's before the ! and what's after it, but after it, there's simply nothing, i.e. an empty string! You might have the same problem in the middle of the string if you didn't wisely add the + to your RegEx.
Add a call to list if you can't work with an iterator. If you want to elegantly get rid of the optional empty string, do:
filter(None, re.split(r"[\s?!,;]+", "Holy moly, feferoni!"))
This will result in:
['Holy', 'moly', 'feferoni']
What this does is remove every element that is not a True value. The filter function generally only returns elements that satisfy a requirement given as a function, but if you pass None it will check if the value itself is True. Because an empty string is False and every other string is True it will remove every empty string from the list.
Also note I removed the escaping of special characters in the character class, as it is simply not neccessary and just makes the RegEx harder to read.
the first thing which comes to my mind is something like this:
>>> mystring = re.split(r"[\s\?\!\,\;]+", "Holy moly, feferoni!")
['Holy', 'moly', 'feferoni', '']
>>> mystring.pop(len(mystring)-1)
>>> print mystring
['Holy', 'moly', 'feferoni']
__import__('re').findall('[^\s?!,;]+', 'Holy moly, feferoni!')

Categories

Resources