Best practices for indexing with pandas - python

I want to select rows based on a mask, idx. I can think of two different possibilities, either using iloc or just using brackets. I have shown the two possibilities (on a dataframe df) below. Are they both equally viable?
idx = (df["timestamp"] >= 5) & (df["timestamp"] <= 10)
idx = idx.values
hr = df["hr"].iloc[idx]
timestamps = df["timestamp"].iloc[idx]
or the following one:
idx = (df["timestamp"] >= 5) & (df["timestamp"] <= 10)
hr = df["hr"][idx]
timestamps = df["timestamp"][idx]

No, they are not the same. One uses direct syntax while the other relies on chained indexing.
The crucial points are:
pd.DataFrame.iloc is used primarily for integer position-based indexing.
pd.DataFrame.loc is most often used with labels or Boolean arrays.
Chained indexing, i.e. via df[x][y], is explicitly discouraged and is never necessary.
idx.values returns the numpy array representation of idx series. This cannot feed .iloc and is not necessary to feed .loc, which can take idx directly.
Below are two examples which would work. In either example, you can use similar syntax to mask a dataframe or series. For example, df['hr'].loc[mask] would work as well as df.loc[mask].
iloc
Here we use numpy.where to extract integer indices of True elements in a Boolean series. iloc does accept Boolean arrays but, in my opinion, this is less clear; "i" stands for integer.
idx = (df['timestamp'] >= 5) & (df['timestamp'] <= 10)
mask = np.where(idx)[0]
df = df.iloc[mask]
loc
Using loc is more natural when we are already querying by specific series.
mask = (df['timestamp'] >= 5) & (df['timestamp'] <= 10)
df = df.loc[mask]
When masking only rows, you can omit the loc accessor altogether and use df[mask].
If masking by rows and filtering for a column, you can use df.loc[mask, 'col_name']
Indexing and Selecting Data is fundamental to pandas: there is no substitute for reading the official documentation.

Don't mix __getitem__ based indexing and (i)loc based. Use one or the other. I prefer (i)loc when you're accessing by index, and __getitem__ when you're accessing by column or using boolean indexing.
Here's some commonly bad methods of indexing:
df.loc[idx].loc[:, col]
df.loc[idx][col]
df[column][idx]
df[column].loc[idx]
The correct method for all the above would be df.loc[idx, col]. If idx is an integer label, use df.loc[df.index[idx], col].
Most of these solutions will cause issues down the pipeline (mainly in the form of a SettingWithCopyWarning), when you try assigning to them, because these create views and are tied to the original DataFrame they're viewing into.
The correct solution to all these versions is df.iloc[idx, df.columns.get_loc(column)] Note that idx is an array of integer indexes, and column is a string label. Similarly for loc.
If you have an array of booleans, use loc instead, like this: df.loc[boolean_idx, column]
Furthermore, these are fine: df[column], and df[boolean_mask]
There are rules for indexing a single row or single column. Depending on how it is done, you will get either a Series or DataFrame. So, if you want to index the 100th row from a DataFrame df as a DataFrame slice, you need to do:
df.iloc[[100], :] # `:` selects every column
And not
df.iloc[100, :]
And similarly for the column-based indexing.
Lastly, if you want to index a single scalar, use at or iat.
OTOH, for your requirement, I would suggest a third alternative:
ts = df.loc[df.timestamp.between(5, 10), 'timestamp']
Or if you're subsetting the entire thing,
df = df[df.timestamp.between(5, 10)]

Related

What advantages does the iloc function have in pandas and Python

I just began to learn Python and Pandas and I saw in many tutorials the use of the iloc function. It is always stated that you can use this function to refer to columns and rows in a dataframe. However, you can also do this directly without the iloc function. So here is an example that yield the same output:
# features is just a dataframe with several rows and columns
features = pd.DataFrame(features_standardized)
y_train = features.iloc[start:end] [[1]]
y_train_noIloc = features [start:end] [[1]]
What is the difference between the two statements and what advantage do I have when using iloc? I'd appreicate every comment.
Per the pandas docs, iloc provides:
Purely integer-location based indexing for selection by position.
Therefore, as shown in the simplistic examples below, [row, col] indexing is not possible without using loc or iloc, as a KeyError will be thrown.
Example:
# Build a simple, sample DataFrame.
df = pd.DataFrame({'a': [1, 2, 3, 4]})
# No iloc
>>> df[0, 0]
KeyError: (0, 0)
# With iloc:
>>> df.iloc[0, 0]
1
The same logic holds true when using loc and a column name.
What is the difference and when does the indexing work without iloc?
The short answer:
Use loc and/or iloc when indexing rows and columns. If indexing on row or column, you can get away without it, and is referred to as 'slicing'.
However, I see in your example [start:end][[1]] has been used. It is generaly considered bad practice to have back-to-back square brackets in pandas, (e.g.: [][]), and generally an indication that a different (more efficient) approach should be taken - in this case, using iloc.
The longer answer:
Adapting your [start:end] slicing example (shown below), indexing works without iloc when indexing (slicing) on row only. The following example does not use iloc and will return rows 0 through 3.
df[0:3]
Output:
a
0 1
1 2
2 3
Note the difference in [0:3] and [0, 3]. The former (slicing) uses a colon and will return rows or indexes 0 through 3. Whereas the latter uses a comma, and is a [row, col] indexer, which requires the use of iloc.
Aside:
The two methods can be combined as show here, and will return rows 0 through 3, for column index 0. Whereas this is not possible without the use of iloc.
df.iloc[0:3, 0]

Understanding bracket filter syntax in pandas

How does the following filter out the results in pandas ? For example, with this statement:
df[['name', 'id', 'group']][df.id.notnull()]
I get 426 rows (it filters out everything where df.group IS NOT NULL). However, if I just use that syntax by itself, it returns a bool for each row, {index: bool}:
[df.group.notnull()]
How does the bracket notation work with pandas ? Another example would be:
df.id[df.id==458514] # filters out rows
# vs
[df.id==458514] # returns a bool
Not a full answer, just a breakdown of df.id[df.id==458514]
df.id returns a series with the contents of column id
df.id[...] slices that series with either 1) a boolean mask, 2) a single index label or a list of them, 3) a slice of labels in the form start:end:step. If it receives a boolean mask then it must be of the same shape as the series being sliced. If it receives index label(s) then it will return those specific rows. Sliciing works just as with python lists, but start and end be integer locations or index labels (e.g. ['a':'e'] will return all rows in between, including 'e').
df.id[df.id==458514] returns a filtered series with your boolean mask, i.e. only the items where df.id equals 458514. It also works with other boolean masks as in df.id[df.name == 'Carl'] or df.id[df.name.isin(['Tom', 'Jerry'])].
Read more in panda's intro to data structures

Assigning values to cross selection of MultiIndex DataFrame (ellipsis style of numpy)

In numpy we can select the last axis with ellipsis indexing, f.i. array[..., 4].
In Pandas DataFrames for structuring large amounts of data, I like to use MultiIndex (which I see as some kind of additional dimensions of the DataFrame). If I want to select a given subset of a DataFrame df, in this case all columns 'key' in the last level of the columns MultiIndex, I can do it with the cross selection method xs:
# create sample multiindex dataframe
mi = pd.MultiIndex.from_product((('a', 'b', 'c'), (1, 2), ('some', 'key', 'foo')))
data = pd.DataFrame(data=np.random.rand(20, 18), columns=mi)
# make cross selection:
xs_df = data.xs('key', axis=1, level=-1)
But if I want to assign values to the cross selection, xs won't work.
The documentation proposes to use IndexSlice to access and set values to a cross selection:
idx = pd.IndexSlice
data.loc[:, idx[:, :, 'key']] *= 10
Which is working well as long as I explicitly enter the number of levels by inserting the correct amount of : before 'key'.
Assuming I just want to give the number of levels to a selection function or f.i. always select the last level, independent of the number of levels of the DataFrame, this won't work (afaik).
My current workaround is using None slices for n_levels to skip:
n_levels = data.columns.nlevels - 1 # assuming I want to select the last level
data.loc[:, (*n_levels*[slice(None)], 'key')] *= 100
This is imho a quite nasty and cumbersome workaround. Is there any more pythonic/nicer/better way?
In this case, you may be better off with get_level_values:
s = data.columns.get_level_values(-1) == 'key'
data.loc[:,s] *= 10
I feel like we can do update and pass drop_level with xs
data.update(data.xs('key',level=-1,axis=1,drop_level=False)*10)
I don't think there is as straightforward a way to index and set values the way you want. Adding to previous answers, I'd suggest naming your columns, ... makes it easier to wrangle with the query method:
#assign names
data.columns = data.columns.set_names(['first','second','third'])
#select interested level :
ind=data.T.query('third=="key"').index
#assign value
data.loc(axis=1)[ind] *=10

Pandas - chaining multiple .loc methods.

In pandas, is there a way to combine both indexing by label and indexing by boolean mask in a single .loc call?
Currently I have this:
df.loc[start_date:end_date][[np.is_busday(x, holidays=dd.all_holidays) for x in df.index]]
Which works fine but I am curious if there is a better alternative. Thanks.
You can convert the index to a series and then use pd.Series.between and pd.Series.apply:
s = pd.Series(df.index)
df.loc[s.between(start_date, end_date) & \
s.apply(np.is_busday, holidays=dd.all_holidays)]
Query may be more efficient as it will be vectorized, but it all depends on how much data you filtered out in the first place.
df.query(
'(#start_date <= index < #end_date) & '
'#np.is_busday(index, holidays=#dd.all_holidays)'
)
Side note, are your certain that your boolean mask works? df and the dataframe (returned by loc) that you are indexing with the mask might not have the same length anymore.

Comparing 2 datetime64[ns] dataframe columns

I have two date columns namely date1 and date2.
I am trying to select rows which have date1 later than date2
I tried to
print df[df.loc[df['date1']>df['date2']]]
but I recieved an error
ValueError: Boolean array expected for the condition, not float64
In either case, the idea is to retrieve a boolean mask. This boolean mask will then be used to index into the dataframe and retrieve corresponding rows. First, generate a mask:
mask = df['date1'] > df['date2']
Now, use this mask to index df:
df = df.loc[mask]
This can be written in a single line.
df = df.loc[df['date1'] > df['date2']]
You do not need to perform another level of indexing after this, df now has your final result. I recommend loc if you are planning to perform operations and reassignment on this filtered dataframe, because loc always returns a copy, while plain indexing returns a view.
Below are some more methods of doing the same thing:
Option 1
df.query
df.query('date1 > date2')
Option 2
df.eval
df[df.eval('date1 > date2')]
If your columns are not dates, you might as well cast them now. Use pd.to_datetime:
df.date1 = pd.to_datetime(df.date1)
df.date2 = pd.to_datetime(df.date2)
Or, when loading your CSV, make sure to set the parse_dates switch on:
df = pd.read_csv(..., parse_dates=['date1, date2'])

Categories

Resources