Assignment through chained indexers - python

I would like to be able to assign to a DataFrame through chained indexers. Notionally like this:
subset = df.loc[mask]
... # much later
subset.loc[mask2, 'column'] += value
This does not work because, as I understand it, the second .loc triggers a copy-on-write. Is there a way to do this?
I could pass df and mask around so that the later code could combine mask and mask2 before making an assignment but it feels much cleaner to be able to pass around the subset view instead so that the later code only has to worry about it's own mask.

When you get to:
subset.loc[mask2, 'column']
assign this to another subset so you can access its index and columns attributes.
subsubset = subset.loc[mask2, 'column']
Then you can access df with subsubset's index and columns
df.loc[subsubset.index, subsubset.columns] += 1

Related

Assigning values to cross selection of MultiIndex DataFrame (ellipsis style of numpy)

In numpy we can select the last axis with ellipsis indexing, f.i. array[..., 4].
In Pandas DataFrames for structuring large amounts of data, I like to use MultiIndex (which I see as some kind of additional dimensions of the DataFrame). If I want to select a given subset of a DataFrame df, in this case all columns 'key' in the last level of the columns MultiIndex, I can do it with the cross selection method xs:
# create sample multiindex dataframe
mi = pd.MultiIndex.from_product((('a', 'b', 'c'), (1, 2), ('some', 'key', 'foo')))
data = pd.DataFrame(data=np.random.rand(20, 18), columns=mi)
# make cross selection:
xs_df = data.xs('key', axis=1, level=-1)
But if I want to assign values to the cross selection, xs won't work.
The documentation proposes to use IndexSlice to access and set values to a cross selection:
idx = pd.IndexSlice
data.loc[:, idx[:, :, 'key']] *= 10
Which is working well as long as I explicitly enter the number of levels by inserting the correct amount of : before 'key'.
Assuming I just want to give the number of levels to a selection function or f.i. always select the last level, independent of the number of levels of the DataFrame, this won't work (afaik).
My current workaround is using None slices for n_levels to skip:
n_levels = data.columns.nlevels - 1 # assuming I want to select the last level
data.loc[:, (*n_levels*[slice(None)], 'key')] *= 100
This is imho a quite nasty and cumbersome workaround. Is there any more pythonic/nicer/better way?
In this case, you may be better off with get_level_values:
s = data.columns.get_level_values(-1) == 'key'
data.loc[:,s] *= 10
I feel like we can do update and pass drop_level with xs
data.update(data.xs('key',level=-1,axis=1,drop_level=False)*10)
I don't think there is as straightforward a way to index and set values the way you want. Adding to previous answers, I'd suggest naming your columns, ... makes it easier to wrangle with the query method:
#assign names
data.columns = data.columns.set_names(['first','second','third'])
#select interested level :
ind=data.T.query('third=="key"').index
#assign value
data.loc(axis=1)[ind] *=10

Get a KeyError in Pandas

I am trying to call a function from a different module as below:
module1 - func1: returns a dataframe
module1 - func2(p_df_in_fromfunc1)
function 2:
for i in range(0,len(p_df_in_fromfunc1):
# Trying to retrieve row values of individual columns and assign to variables
v_tmp = p_df_in_fromfunc1.loc[i,"Col1"]
When trying to run the above code, I get the error:
KeyError 0
Could the issue be because I don't have a zero numbered row?
Without knowing much of you're code, well my guess is, for positional indexing try using iloc instead of loc, if you're interesed in going index-wise.
Something like:
v_tmp = p_df_in_fromfunc1.iloc[i,"Col1"]
You may have a missed to close the quote in the loc function after Col1 ?
v_tmp = p_df_in_fromfunc1.loc[i,"Col1"]
For retrieving a row for specific columns do:
columns = ['Col1', 'Col2']
df[columns].iloc[index]
If you only want one column, you can simplify it to: df['Col1'].iloc(index)
As per your comment, you do not need to reset the index, you can iterate over the values of your index array: df.index

Best practices for indexing with pandas

I want to select rows based on a mask, idx. I can think of two different possibilities, either using iloc or just using brackets. I have shown the two possibilities (on a dataframe df) below. Are they both equally viable?
idx = (df["timestamp"] >= 5) & (df["timestamp"] <= 10)
idx = idx.values
hr = df["hr"].iloc[idx]
timestamps = df["timestamp"].iloc[idx]
or the following one:
idx = (df["timestamp"] >= 5) & (df["timestamp"] <= 10)
hr = df["hr"][idx]
timestamps = df["timestamp"][idx]
No, they are not the same. One uses direct syntax while the other relies on chained indexing.
The crucial points are:
pd.DataFrame.iloc is used primarily for integer position-based indexing.
pd.DataFrame.loc is most often used with labels or Boolean arrays.
Chained indexing, i.e. via df[x][y], is explicitly discouraged and is never necessary.
idx.values returns the numpy array representation of idx series. This cannot feed .iloc and is not necessary to feed .loc, which can take idx directly.
Below are two examples which would work. In either example, you can use similar syntax to mask a dataframe or series. For example, df['hr'].loc[mask] would work as well as df.loc[mask].
iloc
Here we use numpy.where to extract integer indices of True elements in a Boolean series. iloc does accept Boolean arrays but, in my opinion, this is less clear; "i" stands for integer.
idx = (df['timestamp'] >= 5) & (df['timestamp'] <= 10)
mask = np.where(idx)[0]
df = df.iloc[mask]
loc
Using loc is more natural when we are already querying by specific series.
mask = (df['timestamp'] >= 5) & (df['timestamp'] <= 10)
df = df.loc[mask]
When masking only rows, you can omit the loc accessor altogether and use df[mask].
If masking by rows and filtering for a column, you can use df.loc[mask, 'col_name']
Indexing and Selecting Data is fundamental to pandas: there is no substitute for reading the official documentation.
Don't mix __getitem__ based indexing and (i)loc based. Use one or the other. I prefer (i)loc when you're accessing by index, and __getitem__ when you're accessing by column or using boolean indexing.
Here's some commonly bad methods of indexing:
df.loc[idx].loc[:, col]
df.loc[idx][col]
df[column][idx]
df[column].loc[idx]
The correct method for all the above would be df.loc[idx, col]. If idx is an integer label, use df.loc[df.index[idx], col].
Most of these solutions will cause issues down the pipeline (mainly in the form of a SettingWithCopyWarning), when you try assigning to them, because these create views and are tied to the original DataFrame they're viewing into.
The correct solution to all these versions is df.iloc[idx, df.columns.get_loc(column)] Note that idx is an array of integer indexes, and column is a string label. Similarly for loc.
If you have an array of booleans, use loc instead, like this: df.loc[boolean_idx, column]
Furthermore, these are fine: df[column], and df[boolean_mask]
There are rules for indexing a single row or single column. Depending on how it is done, you will get either a Series or DataFrame. So, if you want to index the 100th row from a DataFrame df as a DataFrame slice, you need to do:
df.iloc[[100], :] # `:` selects every column
And not
df.iloc[100, :]
And similarly for the column-based indexing.
Lastly, if you want to index a single scalar, use at or iat.
OTOH, for your requirement, I would suggest a third alternative:
ts = df.loc[df.timestamp.between(5, 10), 'timestamp']
Or if you're subsetting the entire thing,
df = df[df.timestamp.between(5, 10)]

Pandas - chaining multiple .loc methods.

In pandas, is there a way to combine both indexing by label and indexing by boolean mask in a single .loc call?
Currently I have this:
df.loc[start_date:end_date][[np.is_busday(x, holidays=dd.all_holidays) for x in df.index]]
Which works fine but I am curious if there is a better alternative. Thanks.
You can convert the index to a series and then use pd.Series.between and pd.Series.apply:
s = pd.Series(df.index)
df.loc[s.between(start_date, end_date) & \
s.apply(np.is_busday, holidays=dd.all_holidays)]
Query may be more efficient as it will be vectorized, but it all depends on how much data you filtered out in the first place.
df.query(
'(#start_date <= index < #end_date) & '
'#np.is_busday(index, holidays=#dd.all_holidays)'
)
Side note, are your certain that your boolean mask works? df and the dataframe (returned by loc) that you are indexing with the mask might not have the same length anymore.

Proper way to utilize .loc in python's pandas

When trying to change a column of numbers from object to float dtypes using pandas dataframes, I receive the following warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Now, the code runs just fine, but what would be the proper and intended way to avoid this warning and still achieve the goal of:
df2[col] = df2[col].astype('float')
Let it be noted that df2 is a subset of df1 using a condition similar to:
df2 = df1[df1[some col] == value]
Use the copy method. Instead of:
df2 = df1[df1[some col] == value]
Just write:
df2 = df1[df1[some col] == value].copy()
Initially, df2 is a slice of df1 and not a new dataframe. Which is why, when you try to modify it, python raises an error.

Categories

Resources