Coming from R I try to get my head around integer slicing for pandas dataframes.
What puzzles me is the different slicing behavior for rows and columns using the same integer/slice expression.
import pandas as pd
x = pd.DataFrame({'a': range(0,6),
'b': range(7,13),
'c': range(14, 20)})
x.ix[0:2, 0:2] # Why 3 x 2 and not 3 x 3 or 2 x 2?
a b
0 0 7
1 1 8
2 2 9
We get 3 rows but only 2 columns. In the docs I find that different from standard python, label based slicing in pandas is inclusive. Does this apply here and is it inclusive for rows but not for columns then?
Can someone explain the behavior and the rationale behind it?
You are correct that there is a distinction between label based indexing and position based indexing. The first includes the end label, while typical python position based slicing does not include the last item.
In the example you give: x.ix[0:2, 0:2] the rows are being sliced based on the labels, so '2' is included (returning 3 rows), while the columns are sliced based on position, hence returning only 2 columns.
If you want guaranteed position based slicing (to return a 2x2 frame in this case), iloc is the indexer to use:
In [6]: x.iloc[0:2, 0:2]
Out[6]:
a b
0 0 7
1 1 8
For guaranteed position based slicing, you can use the loc indexer.
The ix indexer you are using, is more flexible (not strict in type of indexing). It is primarily label based, but will fall back to position based (when the labels are not found and you are using integers). This is the case in your example for the columns. For this reason, it is recommended to always use loc/iloc instead of ix (unless you need mixed label/position based indexing).
See the docs for a more detailed overview of the different types of indexers: http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing
,ix method is primarily label based with fallback to indexing...from docs online...
A primarily label-location based indexer, with integer position
fallback.
.ix[] supports mixed integer and label based access. It is
primarily label based, but will fall back to integer positional
access unless the corresponding axis is of integer type.
.ix is the most general indexer and will support any of the
inputs in .loc and .iloc. .ix also supports floating
point label schemes. .ix is exceptionally useful when dealing
with mixed positional and label based hierachical indexes.
However, when an axis is integer based, ONLY label based access
and not positional access is supported. Thus, in such cases, it's
usually better to be explicit and use .iloc or .loc.
So rationale is that it is trying to help you. as with most things where software assume your intent it can have unexpected consequences. where it does find the labels in the named range it does an inclusive selection at both ends as this is what you would normally want when you are analyzing data
Related
I am curious as to why df[2] is not supported, while df.ix[2] and df[2:3] both work.
In [26]: df.ix[2]
Out[26]:
A 1.027680
B 1.514210
C -1.466963
D -0.162339
Name: 2000-01-03 00:00:00
In [27]: df[2:3]
Out[27]:
A B C D
2000-01-03 1.02768 1.51421 -1.466963 -0.162339
I would expect df[2] to work the same way as df[2:3] to be consistent with Python indexing convention. Is there a design reason for not supporting indexing row by single integer?
echoing #HYRY, see the new docs in 0.11
http://pandas.pydata.org/pandas-docs/stable/indexing.html
Here we have new operators, .iloc to explicity support only integer indexing, and .loc to explicity support only label indexing
e.g. imagine this scenario
In [1]: df = pd.DataFrame(np.random.rand(5,2),index=range(0,10,2),columns=list('AB'))
In [2]: df
Out[2]:
A B
0 1.068932 -0.794307
2 -0.470056 1.192211
4 -0.284561 0.756029
6 1.037563 -0.267820
8 -0.538478 -0.800654
In [5]: df.iloc[[2]]
Out[5]:
A B
4 -0.284561 0.756029
In [6]: df.loc[[2]]
Out[6]:
A B
2 -0.470056 1.192211
[] slices the rows (by label location) only
The primary purpose of the DataFrame indexing operator, [] is to select columns.
When the indexing operator is passed a string or integer, it attempts to find a column with that particular name and return it as a Series.
So, in the question above: df[2] searches for a column name matching the integer value 2. This column does not exist and a KeyError is raised.
The DataFrame indexing operator completely changes behavior to select rows when slice notation is used
Strangely, when given a slice, the DataFrame indexing operator selects rows and can do so by integer location or by index label.
df[2:3]
This will slice beginning from the row with integer location 2 up to 3, exclusive of the last element. So, just a single row. The following selects rows beginning at integer location 6 up to but not including 20 by every third row.
df[6:20:3]
You can also use slices consisting of string labels if your DataFrame index has strings in it. For more details, see this solution on .iloc vs .loc.
I almost never use this slice notation with the indexing operator as its not explicit and hardly ever used. When slicing by rows, stick with .loc/.iloc.
You can think DataFrame as a dict of Series. df[key] try to select the column index by key and returns a Series object.
However slicing inside of [] slices the rows, because it's a very common operation.
You can read the document for detail:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
To index-based access to the pandas table, one can also consider numpy.as_array option to convert the table to Numpy array as
np_df = df.as_matrix()
and then
np_df[i]
would work.
You can take a look at the source code .
DataFrame has a private function _slice() to slice the DataFrame, and it allows the parameter axis to determine which axis to slice. The __getitem__() for DataFrame doesn't set the axis while invoking _slice(). So the _slice() slice it by default axis 0.
You can take a simple experiment, that might help you:
print df._slice(slice(0, 2))
print df._slice(slice(0, 2), 0)
print df._slice(slice(0, 2), 1)
you can loop through the data frame like this .
for ad in range(1,dataframe_c.size):
print(dataframe_c.values[ad])
I would normally go for .loc/.iloc as suggested by Ted, but one may also select a row by transposing the DataFrame. To stay in the example above, df.T[2] gives you row 2 of df.
If you want to index multiple rows by their integer indexes, use a list of indexes:
idx = [2,3,1]
df.iloc[idx]
N.B. If idx is created using some rule, then you can also sort the dataframe by using .iloc (or .loc) because the output will be ordered by idx. So in a sense, iloc can act like a sorting function where idx is the sorting key.
I have a dataframe, 11 columns 18k rows. The last column is either a 1 or 0, but when I use .describe() all I get is
count 19020
unique 2
top 1
freq 12332
Name: Class, dtype: int64
as opposed to an actual statistical analysis with mean, std, etc.
Is there a way to do this?
If your numeric (0, 1) column is not being picked up automatically by .describe(), it might be because it's not actually encoded as an int dtype. You can see this in the documentation of the .describe() method, which tells you that the default include parameter is only for numeric types:
None (default) : The result will include all numeric columns.
My suggestion would be the following:
df.dtypes # check datatypes
df['num'] = df['num'].astype(int) # if it's not integer, cast it as such
df.describe(include=['object', 'int64']) # explicitly state the data types you'd like to describe
That is, first check the datatypes (I'm assuming the column is called num and the dataframe df, but feel free to substitute with the right ones). If this indicator/(0,1) column is indeed not encoded as int/integer type, then cast it as such by using .astype(int). Then, you can freely use df.describe() and perhaps even specify columns of which data types you want to include in the description output, for more fine-grained control.
You could use
# percentile list
perc =[.20, .40, .60, .80]
# list of dtypes to include
include =['object', 'float', 'int']
data.describe(percentiles = perc, include = include)
where data is your dataframe (important point).
Since you are new to stack, I might suggest that you include some actual code (i.e. something showing how and on what you are using your methods). You'll get better answers
For some reason, the following 2 calls to iloc / loc produce different behavior:
>>> import pandas as pd
>>> df = pd.DataFrame(dict(A=range(3), B=range(3)))
>>> df.iloc[:1]
A B
0 0 0
>>> df.loc[:1]
A B
0 0 0
1 1 1
I understand that loc considers the row labels, while iloc considers the integer-based indices of the rows. But why is the upper bound for the loc call considered inclusive, while the iloc bound is considered exclusive?
Quick answer:
It often makes more sense to do end-inclusive slicing when using labels, because it requires less knowledge about other rows in the DataFrame.
Whenever you care about labels instead of positions, end-exclusive label slicing introduces position-dependence in a way that can be inconvenient.
Longer answer:
Any function's behavior is a trade-off: you favor some use cases over others. Ultimately the operation of .iloc is a subjective design decision by the Pandas developers (as the comment by #ALlollz indicates, this behavior is intentional). But to understand why they might have designed it that way, think about what makes label slicing different from positional slicing.
Imagine we have two DataFrames df1 and df2:
df1 = pd.DataFrame(dict(X=range(4)), index=['a','b','c','d'])
df2 = pd.DataFrame(dict(X=range(3)), index=['b','c','z'])
df1 contains:
X
a 0
b 1
c 2
d 3
df2 contains:
X
b 0
c 1
z 2
Let's say we have a label-based task to perform: we want to get rows between b and c from both df1 and df2, and we want to do it using the same code for both DataFrames. Because b and c don't have the same positions in both DataFrames, simple positional slicing won't do the trick. So we turn to label-based slicing.
If .loc were end-exclusive, to get rows between b and c we would need to know not only the label of our desired end row, but also the label of the next row after that. As constructed, this next label would be different in each DataFrame.
In this case, we would have two options:
Use separate code for each DataFrame: df1.loc['b':'d'] and df2.loc['b':'z']. This is inconvenient because it means we need to know extra information beyond just the rows that we want.
For either dataframe, get the positional index first, add 1, and then use positional slicing: df.iloc[df.index.get_loc('b'):df.index.get_loc('c')+1]. This is just wordy.
But since .loc is end-inclusive, we can just say .loc['b':'c']. Much simpler!
Whenever you care about labels instead of positions, and you're trying to write position-independent code, end-inclusive label slicing re-introduces position-dependence in a way that can be inconvenient.
That said, maybe there are use cases where you really do want end-exclusive label-based slicing. If so, you can use #Willz's answer in this question:
df.loc[start:end].iloc[:-1]
It might be a stupid question but it is driving me crazy. I have a corpus composed by 8807 articles:
print(type(doc_set))
class 'pandas.core.series.Series'
print(len(doc_set))
8807
From this list, I just want to select the first one. I have tried doc_set[1], but it retrieves 46 articles. Any idea of how to select a specific article? Thanks
try to use iloc locator:
doc_set.iloc[0]
Docs [iloc]:
Purely integer-location based indexing for selection by position.
.iloc[] is primarily integer position based (from 0 to length-1 of the
axis), but may also be used with a boolean array.
Allowed inputs are:
An integer, e.g. 5. A list or array of integers, e.g. [4, 3, 0]. A
slice object with ints, e.g. 1:7. A boolean array. A callable function
with one argument (the calling Series, DataFrame or Panel) and that
returns valid output for indexing (one of the above) .iloc will raise
IndexError if a requested indexer is out-of-bounds, except slice
indexers which allow out-of-bounds indexing (this conforms with
python/numpy slice semantics).
or iat locator:
doc_set.iat[0]
Docs [iat]:
Fast integer location scalar accessor.
Similarly to iloc, iat provides integer based lookups. You can also
set using these indexers.
PS iat should be faster compared to iloc, because the latter one does some overhead
I think you have duplicity in index.
Use iat if you need select first value of Series:
doc_set = pd.Series([8,9,10], index=[1,1,1])
print (doc_set)
1 8
1 9
1 10
dtype: int64
print (doc_set[1])
1 8
1 9
1 10
dtype: int64
print (doc_set.iat[0])
8
I have two 2D numpy arrays shaped:
(19133L, 12L)
(248L, 6L)
In each case, the first 3 fields form an identifier.
I want to reduce the larger matrix so that it only contains rows with identifiers that also exist in the second matrix. So the shape should be (248L, 12L). How can I do this?
I would then like to sort it so that the arrays are indexed by the first value, second value and third value so that (3 3 4) comes after (3 3 5) etc. Is there a multi field sort function?
Edit:
I have tried pandas:
df1 = DataFrame(arr1.astype(str))
df2 = DataFrame(arr2.astype(str))
df1.set_index([0,1,2])
df2.set_index([0,1,2])
out = merge(df1,df2,how="inner")
print(out.shape)
But this results in (0,13) shape
Use pandas.
pandas.set_index() allows multiple keys. So set the index to the first three columns (use drop=False, inplace=True) to avoid needlessly mutating or copying your dataframe.
Then, merge(...how='inner') to intersect your dataframes.
In general, numpy runs out of steam very quickly for arbitrary dataframe manipulations; your default thing should be to try pandas. Also much more performant.