selecting sub-sequence confusion in python [duplicate] - python

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
The Python Slice Notation
I am confused with the way python subsequence selection works.
suppose i have this following code:
>>> t = 'hi'
>>> t[:3]
'hi'
>>> t[3:]
''
>>> print t[:3] + t[3:]
hi
>>> print t[3]
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
print t[3]
IndexError: string index out of range
please explain how this thing works in python

Subsequence, or slice, notation is forgiving. t[:3] will get you a slice of t from the beginning up to the end or the third element, whichever comes first, t[3:] will get you a slice of t from the third element if it exists through the end. Direct indexing such as t[3] is not forgiving; the indexed element must exist or else you get an exception. With slices, if the end index is out of range, you get the whole original list, if the start index is out of range, you get an empty list.

I always find it somewhat funny behavior of sequences that they allow slicing out of bounds. However, this is documented. Specifically in bullet point 4 which describes slicing of a sequence type:
The slice of s from i to j is defined as the sequence of items with index k such that i <= k < j. If i or j is greater than len(s), use len(s). If i is omitted or None, use 0. If j is omitted or None, use len(s). If i is greater than or equal to j, the slice is empty.
or bullet point 5 which describes slicing with the optional stride parameter:
The slice of s from i to j with step k is defined as the sequence of items with index x = i + n*k such that 0 <= n < (j-i)/k. In other words, the indices are i, i+k, i+2*k, i+3*k and so on, stopping when j is reached (but never including j). If i or j is greater than len(s), use len(s). If i or j are omitted or None, they become “end” values (which end depends on the sign of k). Note, k cannot be zero. If k is None, it is treated like 1
Note that if you look at point 3 (which describes s[index]), there is no corresponding transform of out-of-bounds indices to in-bounds-indices.

t[start:stop] prints all elements x with start <= x < stop. When some elements do not exist it simply does not print them.
t[index] on the other hand gives an error if there is no element at given index.
In your example only t[0]='h' and t[1]='i' exist which explaines your results.
print t[3:] should return nothing instead of 'hi' which is also the case at my python interpreter.

Related

Why does out of range numpy list slicing not raise IndexError [duplicate]

Why doesn't 'example'[999:9999] result in error? Since 'example'[9] does, what is the motivation behind it?
From this behavior I can assume that 'example'[3] is, essentially/internally, not the same as 'example'[3:4], even though both result in the same 'm' string.
You're correct! 'example'[3:4] and 'example'[3] are fundamentally different, and slicing outside the bounds of a sequence (at least for built-ins) doesn't cause an error.
It might be surprising at first, but it makes sense when you think about it. Indexing returns a single item, but slicing returns a subsequence of items. So when you try to index a nonexistent value, there's nothing to return. But when you slice a sequence outside of bounds, you can still return an empty sequence.
Part of what's confusing here is that strings behave a little differently from lists. Look what happens when you do the same thing to a list:
>>> [0, 1, 2, 3, 4, 5][3]
3
>>> [0, 1, 2, 3, 4, 5][3:4]
[3]
Here the difference is obvious. In the case of strings, the results appear to be identical because in Python, there's no such thing as an individual character outside of a string. A single character is just a 1-character string.
(For the exact semantics of slicing outside the range of a sequence, see mgilson's answer.)
For the sake of adding an answer that points to a robust section in the documentation:
Given a slice expression like s[i:j:k],
The slice of s from i to j with step k is defined as the sequence of items with index x = i + n*k such that 0 <= n < (j-i)/k. In other words, the indices are i, i+k, i+2*k, i+3*k and so on, stopping when j is reached (but never including j). When k is positive, i and j are reduced to len(s) if they are greater
if you write s[999:9999], python is returning s[len(s):len(s)] since len(s) < 999 and your step is positive (1 -- the default).
Slicing is not bounds-checked by the built-in types. And although both of your examples appear to have the same result, they work differently; try them with a list instead.

Why does `"abc"[10:11]` not produce an index error but `"abc"[10]` does? [duplicate]

Why doesn't 'example'[999:9999] result in error? Since 'example'[9] does, what is the motivation behind it?
From this behavior I can assume that 'example'[3] is, essentially/internally, not the same as 'example'[3:4], even though both result in the same 'm' string.
You're correct! 'example'[3:4] and 'example'[3] are fundamentally different, and slicing outside the bounds of a sequence (at least for built-ins) doesn't cause an error.
It might be surprising at first, but it makes sense when you think about it. Indexing returns a single item, but slicing returns a subsequence of items. So when you try to index a nonexistent value, there's nothing to return. But when you slice a sequence outside of bounds, you can still return an empty sequence.
Part of what's confusing here is that strings behave a little differently from lists. Look what happens when you do the same thing to a list:
>>> [0, 1, 2, 3, 4, 5][3]
3
>>> [0, 1, 2, 3, 4, 5][3:4]
[3]
Here the difference is obvious. In the case of strings, the results appear to be identical because in Python, there's no such thing as an individual character outside of a string. A single character is just a 1-character string.
(For the exact semantics of slicing outside the range of a sequence, see mgilson's answer.)
For the sake of adding an answer that points to a robust section in the documentation:
Given a slice expression like s[i:j:k],
The slice of s from i to j with step k is defined as the sequence of items with index x = i + n*k such that 0 <= n < (j-i)/k. In other words, the indices are i, i+k, i+2*k, i+3*k and so on, stopping when j is reached (but never including j). When k is positive, i and j are reduced to len(s) if they are greater
if you write s[999:9999], python is returning s[len(s):len(s)] since len(s) < 999 and your step is positive (1 -- the default).
Slicing is not bounds-checked by the built-in types. And although both of your examples appear to have the same result, they work differently; try them with a list instead.

What are the default slice indices *really*?

From the python documentation docs.python.org/tutorial/introduction.html#strings:
Slice indices have useful defaults; an omitted first index defaults to zero, an omitted second index defaults to the size of the string being sliced.
For the standard case, this makes a lot of sense:
>>> s = 'mystring'
>>> s[1:]
'ystring'
>>> s[:3]
'mys'
>>> s[:-2]
'mystri'
>>> s[-1:]
'g'
>>>
So far, so good. However, using a negative step value seems to suggest slightly different defaults:
>>> s[:3:-1]
'gnir'
>>> s[0:3:-1]
''
>>> s[2::-1]
'sym'
Fine, perhaps if the step is negative, the defaults reverse. An ommitted first index defaults to the size of the string being sliced, an omitted second index defaults to zero:
>>> s[len(s):3:-1]
'gnir'
Looking good!
>>> s[2:0:-1]
'sy'
Whoops. Missed that 'm'.
Then there is everyone's favorite string reverse statement. And sweet it is:
>>> s[::-1]
'gnirtsym'
However:
>>> s[len(s):0:-1]
'gnirtsy'
The slice never includes the value of the second index in the slice. I can see the consistency of doing it that way.
So I think I am beginning to understand the behavior of slice in its various permutations. However, I get the feeling that the second index is somewhat special, and that the default value of the second index for a negative step can not actually be defined in terms of a number.
Can anyone concisely define the default slice indices that can account for the provided examples? Documentation would be a huge plus.
There actually aren't any defaults; omitted values are treated specially.
However, in every case, omitted values happen to be treated in exactly the same way as None. This means that, unless you're hacking the interpreter (or using the parser, ast, etc. modules), you can just pretend that the defaults are None (as recursive's answer says), and you'll always get the right answers.
The informal documentation cited isn't quite accurate—which is reasonable for something that's meant to be part of a tutorial. For the real answers, you have to turn to the reference documentation.
For 2.7.3, Sequence Types describes slicing in notes 3, 4, and 5.
For [i:j]:
… If i is omitted or None, use 0. If j is omitted or None, use len(s).
And for [i:j:k]:
If i or j are omitted or None, they become “end” values (which end depends on the sign of k). Note, k cannot be zero. If k is None, it is treated like 1.
For 3.3, Sequence Types has the exact same wording as 2.7.3.
The end value is always exclusive, thus the 0 end value means include index 1 but not 0. Use None instead (since negative numbers have a different meaning):
>>> s[len(s)-1:None:-1]
'gnirtsym'
Note the start value as well; the last character index is at len(s) - 1; you may as well spell that as -1 (as negative numbers are interpreted relative to the length):
>>> s[-1:None:-1]
'gnirtsym'
I don't have any documentation, but I think the default is [None:None:None]
>>> "asdf"[None:None:None]
'asdf'
>>> "asdf"[None:None:-1]
'fdsa'
The notes in the reference documentation for sequence types explains this in some detail:
(5.) The slice of s from i to j with step k is defined as the sequence of items with index x = i + n*k such that 0 <= n < (j-i)/k. In other words, the indices are i, i+k, i+2*k, i+3*k and so on, stopping when j is reached (but never including j). If i or j is greater than len(s), use len(s). If i or j are omitted or None, they become “end” values (which end depends on the sign of k). Note, k cannot be zero. If k is None, it is treated like 1.
So you can get the following behaviour:
>>> s = "mystring"
>>> s[2:None:-1]
'sym'
Actually it is logical ...
if you look to the end value, it always points to the index after the last index.
So, using 0 as the end value, means it gets till element at index 1. So, you need to omit that value .. so that it returns the string you want.
>>> s = '0123456789'
>>> s[0], s[:0]
('0', '')
>>> s[1], s[:1]
('1', '0')
>>> s[2], s[:2]
('2', '01')
>>> s[3], s[:3]
('3', '012')
>>> s[0], s[:0:-1]
('0', '987654321')
Useful to know if you are implementing __getslice__: j defaults to sys.maxsize (https://docs.python.org/2/reference/datamodel.html#object.getslice)
>>> class x(str):
... def __getslice__(self, i, j):
... print i
... print j
...
... def __getitem__(self, key):
... print repr(key)
...
>>> x()[:]
0
9223372036854775807
>>> x()[::]
slice(None, None, None)
>>> x()[::1]
slice(None, None, 1)
>>> x()[:1:]
slice(None, 1, None)
>>> import sys
>>> sys.maxsize
9223372036854775807L
There are excellent answers and the best one is selected as accepted answer, but if you are looking for a way to wrap your head around default values for slice, then it helps to imagine list as having two ends. Starting with HEAD end then the first element and so on, until the TAIL end after the last element.
Now answering the actual question:
There are two defaults for the slices
Defaults when step is +ve
0:TAIL:+ve step
Defaults when step is -ve
HEAD:-1:-ve step
Great question. I thought I knew how slicing worked until I read this post. While your question title asks about "default slice indices" and that's been answered by abarnet, Martijn, and others, the body of your post suggests your real question is "How does slicing work". So, I'll take a stab at that..
Explanation
Given your example, s = “mystring”, you can imagine a set of positive and negative indices.
m y s t r i n g
0 1 2 3 4 5 6 7 <- positive indices
-8 -7 -6 -5 -4 -3 -2 -1 <- negative indices
We select slices of the form s[i:j:k]. The logic changes depending on whether k is positive or negative. I would describe the algorithm as follows.
if k is empty, set k = 1
if k is positive:
move right, from i (inclusive) to j (exclusive) stepping by abs(k)
if i is empty, start from the left edge
if j is empty, go til the right edge
if k is negative:
move left, from i (inclusive) to j (exclusive) stepping by abs(k)
if i is empty, start from the right edge
if j is empty, go til the left edge
(Note this isn't exactly pseudo code, as I intended it to be more comprehendible.)
Examples
>>> s[:3:]
'mys'
Here, k is empty so we set it equal to 1. Then since k is positive, we move right from i to j. Since i is empty, we start from the left edge and select everything up to but excluding the element at index 3.
>>> s[:3:-1]
'gnir'
Here, k is negative, so we move left from i to j. Since i is empty, we start from the right edge and select everything up to but excluding the element at index 3.
>>> s[0:3:-1]
''
Here, k is negative, so we move left from i to j. Since index 3 isn't to the left of index 0, no elements are selected and we get back the empty string.

Why does substring slicing with index out of range work?

Why doesn't 'example'[999:9999] result in error? Since 'example'[9] does, what is the motivation behind it?
From this behavior I can assume that 'example'[3] is, essentially/internally, not the same as 'example'[3:4], even though both result in the same 'm' string.
You're correct! 'example'[3:4] and 'example'[3] are fundamentally different, and slicing outside the bounds of a sequence (at least for built-ins) doesn't cause an error.
It might be surprising at first, but it makes sense when you think about it. Indexing returns a single item, but slicing returns a subsequence of items. So when you try to index a nonexistent value, there's nothing to return. But when you slice a sequence outside of bounds, you can still return an empty sequence.
Part of what's confusing here is that strings behave a little differently from lists. Look what happens when you do the same thing to a list:
>>> [0, 1, 2, 3, 4, 5][3]
3
>>> [0, 1, 2, 3, 4, 5][3:4]
[3]
Here the difference is obvious. In the case of strings, the results appear to be identical because in Python, there's no such thing as an individual character outside of a string. A single character is just a 1-character string.
(For the exact semantics of slicing outside the range of a sequence, see mgilson's answer.)
For the sake of adding an answer that points to a robust section in the documentation:
Given a slice expression like s[i:j:k],
The slice of s from i to j with step k is defined as the sequence of items with index x = i + n*k such that 0 <= n < (j-i)/k. In other words, the indices are i, i+k, i+2*k, i+3*k and so on, stopping when j is reached (but never including j). When k is positive, i and j are reduced to len(s) if they are greater
if you write s[999:9999], python is returning s[len(s):len(s)] since len(s) < 999 and your step is positive (1 -- the default).
Slicing is not bounds-checked by the built-in types. And although both of your examples appear to have the same result, they work differently; try them with a list instead.

Python Syntax / List Slicing Question: What does this syntax mean?

lines = file('info.csv','r').readlines()
counts = []
for i in xrange(4):
counts.append(fromstring(lines[i][:-2],sep=',')[0:-1])
If anyone can explain this code to me, it would be greatly appreciated. I can't seem to find more advanced examples on slicing--only very simple ones that don't explain this situation.
Thank you very much.
A slice takes the form o[start:stop:step], all of which are optional. start defaults to 0, the first index. stop defaults to len(o), the closed upper bound on the indicies of the list. step defaults to 1, including every value of the list.
If you specify a negative value, it represents an offset from the end of the list. For example, [-1] access the last element in a list, and -2 the second last.
If you enter a non-1 value for step, you will include different elements or include them in a different order. 2 would skip every other element. 3 would skip two out of every three. -1 would go backwards through the list.
[:-2]
Since start is omitted, it defaults to the beginning of the list. A stop of -2 indicates to exclude the last two elements. So o[:-2] slices the list to exclude the last two elements.
[0:-1]
The 0 here is redundant, because it's what start would have defaulted to anyway. This is the same as the other slice, except that it only excludes the last element.
From the Data model page of the Python 2.7 docs:
Sequences also support slicing: a[i:j] selects all items with index k such that i <= k < j. When used as an expression, a slice is a sequence of the same type. This implies that the index set is renumbered so that it starts at 0.
Some sequences also support “extended slicing” with a third “step” parameter: a[i:j:k] selects all items of a with index x where x = i + n*k, n >= 0 and i <= x < j.
The "what's new" section of the Python 2.3 documentation discusses them as well, when they were added to the language.
A good way to understand the slice syntax is to think of it as syntactic sugar for the equivalent for loop. For example:
L[a:b:c]
Is equivalent to (e.g., in C):
for(int i = a; i < b; i += c) {
// slice contains L[i]
}
Where a defaults to 0, b defaults to len(L), and c defaults to 1.
(And if c, the step, is a negative number, then the default values of a and b are reversed. This gives a sensible result for L[::-1]).
Then the only other thing you need to know is that, in Python, indexes "wrap around", so that L[-1] signifies the last item in the list, L[-2] is the second to last, and so forth.
If list is a list then list[-1] is the last element of the list, list[-2] is the element before it and so on.
Also, list[a:b] means the list with all elements in list at positions between a and b. If one of them is missing, it is assumed to mean the end of the list. Thus, list[2:] is the list of all elements starting from list[2]. And list[:-2] is the list of all elements from list[0] to list[-2].
In your code, the [0:-1] part it the same as [:-1].

Categories

Resources