pandas: slicing along first level of multiindex - python

I've set up a DataFrame with two indices. But slicing doesn't behave as expected.
I realize that this is a very basic problem, so I searched for similar questions:
pandas: slice a MultiIndex by range of secondary index
Python Pandas slice multiindex by second level index (or any other level)
I also looked at the corresponding documentation
Strangely none of the proposed solutions work for me.
I've set up a simple example to showcase the problem:
# this is my DataFrame
frame = pd.DataFrame([
{"a":1, "b":1, "c":"11"},
{"a":1, "b":2, "c":"12"},
{"a":2, "b":1, "c":"21"},
{"a":2, "b":2, "c":"22"},
{"a":3, "b":1, "c":"31"},
{"a":3, "b":2, "c":"32"}])
# now set a and b as multiindex
frame = frame.set_index(["a","b"])
Now I'm trying different ways of slicing the frame.
The first two lines work, the third throws an exception:
# selecting a specific cell works
frame.loc[1,2]
# slicing along the second index works
frame.loc[1,:]
# slicing along the first doesn't work
frame.loc[:,1]
It's a TypeError:
TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1] of <class 'int'>
Solution 1: Using tuples of slices
This is proposed in this question: pandas: slice a MultiIndex by range of secondary index
Indeed, you can pass a slice for each level
But that doesn't work for me, the same type error as above is produced.
frame.loc[(slice(1,2), 1)]
Solution 2: Using IndexSlice
Python Pandas slice multiindex by second level index (or any other level)
Use an indexer to slice arbitrary values in arbitrary dimensions
Again, that doesn't work for me, it produces the same type error.
frame.loc[pd.IndexSlice[:,2]]
I don't understand how this typeerror can be produced. After all I can use integers to select specific cells, and ranges along the second dimension work fine.
Googling for my specific error message doesn't really help.
For example, here someone tries to use integers to slice along an index of type float: https://github.com/pandas-dev/pandas/issues/12333
I tried explicitly converting my indices to int, maybe the numpy backend stores everything as float by default ?
But that didn't change anything, afterwards the same errors as above appear:
frame["a"]=frame["a"].apply(lambda x : int(x))
frame["b"]=frame["b"].apply(lambda x : int(x))
type(frame["b"][0]) # it's numpy.int64

IIUC you just have to specify : for columns when indexing a multi-index DF:
In [40]: frame.loc[pd.IndexSlice[:,2], :]
Out[40]:
c
a b
1 2 12
2 2 22
3 2 32

Related

Does iloc use the indices or the place of the row

I have extracted few rows from a dataframe to a new dataframe. In this new dataframe old indices remain. However, when i want to specify range from this new dataframe i used it like new indices, starting from zero. Why did it work? Whenever I try to use the old indices it gives an error.
germany_cases = virus_df_2[virus_df_2['location'] == 'Germany']
germany_cases = germany_cases.iloc[:190]
This is the code. The rows that I extracted from the dataframe virus_df_2 have indices between 16100 and 16590. I wanted to take the first 190 rows. in the second line of code i used iloc[:190] and it worked. However, when i tried to use iloc[16100:16290] it gave an error. What could be the reason?
In pandas there are two attributes, loc and iloc.
The iloc is, as you have noticed, an indexing based on the order of the rows in memory, so there you can reference the nth line using iloc[n].
In order to reference rows using the pandas indexing, which can be manually altered and can not only be integers but also strings or other objects that are hashable (have the __hash__ method defined), you should use loc attribute.
In your case, iloc raises an error because you are trying to access a range that is outside the region defined by your dataframe. You can try loc instead and it will be ok.
At first it will be hard to grasp the indexing notation, but it can be very helpful in some circumstances, like for example sorting or performing grouping operations.
Quick example that might help:
df = pd.DataFrame(
dict(
France=[1, 2, 3],
Germany=[4, 5, 6],
UK=['x', 'y', 'z'],
))
df = df.loc[:,"Germany"].iloc[1:2]
Out:
1 5
Name: Germany, dtype: int64
Hope I could help.

What advantages does the iloc function have in pandas and Python

I just began to learn Python and Pandas and I saw in many tutorials the use of the iloc function. It is always stated that you can use this function to refer to columns and rows in a dataframe. However, you can also do this directly without the iloc function. So here is an example that yield the same output:
# features is just a dataframe with several rows and columns
features = pd.DataFrame(features_standardized)
y_train = features.iloc[start:end] [[1]]
y_train_noIloc = features [start:end] [[1]]
What is the difference between the two statements and what advantage do I have when using iloc? I'd appreicate every comment.
Per the pandas docs, iloc provides:
Purely integer-location based indexing for selection by position.
Therefore, as shown in the simplistic examples below, [row, col] indexing is not possible without using loc or iloc, as a KeyError will be thrown.
Example:
# Build a simple, sample DataFrame.
df = pd.DataFrame({'a': [1, 2, 3, 4]})
# No iloc
>>> df[0, 0]
KeyError: (0, 0)
# With iloc:
>>> df.iloc[0, 0]
1
The same logic holds true when using loc and a column name.
What is the difference and when does the indexing work without iloc?
The short answer:
Use loc and/or iloc when indexing rows and columns. If indexing on row or column, you can get away without it, and is referred to as 'slicing'.
However, I see in your example [start:end][[1]] has been used. It is generaly considered bad practice to have back-to-back square brackets in pandas, (e.g.: [][]), and generally an indication that a different (more efficient) approach should be taken - in this case, using iloc.
The longer answer:
Adapting your [start:end] slicing example (shown below), indexing works without iloc when indexing (slicing) on row only. The following example does not use iloc and will return rows 0 through 3.
df[0:3]
Output:
a
0 1
1 2
2 3
Note the difference in [0:3] and [0, 3]. The former (slicing) uses a colon and will return rows or indexes 0 through 3. Whereas the latter uses a comma, and is a [row, col] indexer, which requires the use of iloc.
Aside:
The two methods can be combined as show here, and will return rows 0 through 3, for column index 0. Whereas this is not possible without the use of iloc.
df.iloc[0:3, 0]

Why is it ok to have an index with lists as values but not ok for columns?

Consider the numpy.array i
i = np.empty((1,), dtype=object)
i[0] = [1, 2]
i
array([list([1, 2])], dtype=object)
Example 1
index
df = pd.DataFrame([1], index=i)
df
0
[1, 2] 1
Example 2
columns
But
df = pd.DataFrame([1], columns=i)
Leads to this when I display it
df
TypeError: unhashable type: 'list'
However, df.T works!?
Question
Why is it necessary for index values to be hashable in a column context but not in an index context? And why only when it's displayed?
This is because of how pandas internally determines the string representation of the DataFrame object. Essentially, the difference between column labels and index labels here is that the column determines the format of the string representation (as the column could be a float, int, etc.).
The error thus happens because pandas stores a separate formatter object for each column in a dictionary and this object is retrieved using the column name. Specifically, the line that triggers the error is https://github.com/pandas-dev/pandas/blob/d1accd032b648c9affd6dce1f81feb9c99422483/pandas/io/formats/format.py#L420
The "unhashable type" error usually means that the type, list in this case, is mutable. Mutable types aren't hashable, because they may change after they have produced the hash code. This happens because you are trying to retrieve an item using a list as a key, but since a key has to be hashable, the retrieval fails.

Why is 'series' type not a special case of 'dataframe' type in pandas? [duplicate]

In Pandas, when I select a label that only has one entry in the index I get back a Series, but when I select an entry that has more then one entry I get back a data frame.
Why is that? Is there a way to ensure I always get back a data frame?
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame
In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series
Granted that the behavior is inconsistent, but I think it's easy to imagine cases where this is convenient. Anyway, to get a DataFrame every time, just pass a list to loc. There are other ways, but in my opinion this is the cleanest.
In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame
In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame
The TLDR
When using loc
df.loc[:] = Dataframe
df.loc[int] = Dataframe if you have more than one column and Series if you have only 1 column in the dataframe
df.loc[:, ["col_name"]] = Dataframe if you have more than one row and Series if you have only 1 row in the selection
df.loc[:, "col_name"] = Series
Not using loc
df["col_name"] = Series
df[["col_name"]] = Dataframe
You have an index with three index items 3. For this reason df.loc[3] will return a dataframe.
The reason is that you don't specify the column. So df.loc[3] selects three items of all columns (which is column 0), while df.loc[3,0] will return a Series. E.g. df.loc[1:2] also returns a dataframe, because you slice the rows.
Selecting a single row (as df.loc[1]) returns a Series with the column names as the index.
If you want to be sure to always have a DataFrame, you can slice like df.loc[1:1]. Another option is boolean indexing (df.loc[df.index==1]) or the take method (df.take([0]), but this used location not labels!).
Use df['columnName'] to get a Series and df[['columnName']] to get a Dataframe.
You wrote in a comment to joris' answer:
"I don't understand the design
decision for single rows to get converted into a series - why not a
data frame with one row?"
A single row isn't converted in a Series.
It IS a Series: No, I don't think so, in fact; see the edit
The best way to think about the pandas data structures is as flexible
containers for lower dimensional data. For example, DataFrame is a
container for Series, and Panel is a container for DataFrame objects.
We would like to be able to insert and remove objects from these
containers in a dictionary-like fashion.
http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure
The data model of Pandas objects has been choosen like that. The reason certainly lies in the fact that it ensures some advantages I don't know (I don't fully understand the last sentence of the citation, maybe it's the reason)
.
Edit : I don't agree with me
A DataFrame can't be composed of elements that would be Series, because the following code gives the same type "Series" as well for a row as for a column:
import pandas as pd
df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])
print '-------- df -------------'
print df
print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])
print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])
result
-------- df -------------
0
2 11
3 12
3 13
------- df.loc[2] --------
0 11
Name: 2, dtype: int64
type(df.loc[1]) : <class 'pandas.core.series.Series'>
--------- df[0] ----------
2 11
3 12
3 13
Name: 0, dtype: int64
type(df[0]) : <class 'pandas.core.series.Series'>
So, there is no sense to pretend that a DataFrame is composed of Series because what would these said Series be supposed to be : columns or rows ? Stupid question and vision.
.
Then what is a DataFrame ?
In the previous version of this answer, I asked this question, trying to find the answer to the Why is that? part of the question of the OP and the similar interrogation single rows to get converted into a series - why not a data frame with one row? in one of his comment,
while the Is there a way to ensure I always get back a data frame? part has been answered by Dan Allan.
Then, as the Pandas' docs cited above says that the pandas' data structures are best seen as containers of lower dimensional data, it seemed to me that the understanding of the why would be found in the characteristcs of the nature of DataFrame structures.
However, I realized that this cited advice must not be taken as a precise description of the nature of Pandas' data structures.
This advice doesn't mean that a DataFrame is a container of Series.
It expresses that the mental representation of a DataFrame as a container of Series (either rows or columns according the option considered at one moment of a reasoning) is a good way to consider DataFrames, even if it isn't strictly the case in reality. "Good" meaning that this vision enables to use DataFrames with efficiency. That's all.
.
Then what is a DataFrame object ?
The DataFrame class produces instances that have a particular structure originated in the NDFrame base class, itself derived from the PandasContainer base class that is also a parent class of the Series class.
Note that this is correct for Pandas until version 0.12. In the upcoming version 0.13, Series will derive also from NDFrame class only.
# with pandas 0.12
from pandas import Series
print 'Series :\n',Series
print 'Series.__bases__ :\n',Series.__bases__
from pandas import DataFrame
print '\nDataFrame :\n',DataFrame
print 'DataFrame.__bases__ :\n',DataFrame.__bases__
print '\n-------------------'
from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__ :\n',NDFrame.__bases__
from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__ :\n',PandasContainer.__bases__
from pandas.core.base import PandasObject
print '\nPandasObject.__bases__ :\n',PandasObject.__bases__
from pandas.core.base import StringMixin
print '\nStringMixin.__bases__ :\n',StringMixin.__bases__
result
Series :
<class 'pandas.core.series.Series'>
Series.__bases__ :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)
DataFrame :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__ :
(<class 'pandas.core.generic.NDFrame'>,)
-------------------
NDFrame.__bases__ :
(<class 'pandas.core.generic.PandasContainer'>,)
PandasContainer.__bases__ :
(<class 'pandas.core.base.PandasObject'>,)
PandasObject.__bases__ :
(<class 'pandas.core.base.StringMixin'>,)
StringMixin.__bases__ :
(<type 'object'>,)
So my understanding is now that a DataFrame instance has certain methods that have been crafted in order to control the way data are extracted from rows and columns.
The ways these extracting methods work are described in this page:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
We find in it the method given by Dan Allan and other methods.
Why these extracting methods have been crafted as they were ?
That's certainly because they have been appraised as the ones giving the better possibilities and ease in data analysis.
It's precisely what is expressed in this sentence:
The best way to think about the pandas data structures is as flexible
containers for lower dimensional data.
The why of the extraction of data from a DataFRame instance doesn't lies in its structure, it lies in the why of this structure. I guess that the structure and functionning of the Pandas' data structure have been chiseled in order to be as much intellectually intuitive as possible, and that to understand the details, one must read the blog of Wes McKinney.
If the objective is to get a subset of the data set using the index, it is best to avoid using loc or iloc. Instead you should use syntax similar to this :
df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3]
isinstance(result, pd.DataFrame) # True
result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True
every time we put [['column name']] it returns Pandas DataFrame object,
if we put ['column name'] we got Pandas Series object
If you also select on the index of the dataframe then the result can be either a DataFrame or a Series or it can be a Series or a scalar (single value).
This function ensures that you always get a list from your selection (if the df, index and column are valid):
def get_list_from_df_column(df, index, column):
df_or_series = df.loc[index,[column]]
# df.loc[index,column] is also possible and returns a series or a scalar
if isinstance(df_or_series, pd.Series):
resulting_list = df_or_series.tolist() #get list from series
else:
resulting_list = df_or_series[column].tolist()
# use the column key to get a series from the dataframe
return(resulting_list)

Pandas selecting with unaligned indexes

I have 2 series.
The first one contains a list of numbers with an index counting 0..8.
A = pd.Series([2,3,4,6,5,4,7,6,5], name=['A'], index=[0,1,2,3,4,5,6,7,8])
The second one only contains True values, but the index of the series is a subset of the first one.
B = pd.Series([1, 1, 1, 1, 1], name=['B'], index=[0,2,4,7,8], dtype=bool)
I'd like to use B as boolean vector to get the A-values for the corresponding indexes, like:
A[B]
[...]
IndexingError: Unalignable boolean Series key provided
Unfortunately this raises an error.
Do I need to align them first?
Does
A[B.index.values]
work for your version of pandas? (I see we have different versions because now the Series name has to be hashable, so your code gave me an error)

Categories

Resources