In pandas, is there a way to combine both indexing by label and indexing by boolean mask in a single .loc call?
Currently I have this:
df.loc[start_date:end_date][[np.is_busday(x, holidays=dd.all_holidays) for x in df.index]]
Which works fine but I am curious if there is a better alternative. Thanks.
You can convert the index to a series and then use pd.Series.between and pd.Series.apply:
s = pd.Series(df.index)
df.loc[s.between(start_date, end_date) & \
s.apply(np.is_busday, holidays=dd.all_holidays)]
Query may be more efficient as it will be vectorized, but it all depends on how much data you filtered out in the first place.
df.query(
'(#start_date <= index < #end_date) & '
'#np.is_busday(index, holidays=#dd.all_holidays)'
)
Side note, are your certain that your boolean mask works? df and the dataframe (returned by loc) that you are indexing with the mask might not have the same length anymore.
Related
I have a df as below
I want to make this df binary as follows
I tried
df[:]=np.where(df>0, 1, 0)
but with this I am losing my df index.
I can try this on all columns one by one or use loop but I think there would be some easy & quick way to do this.
You can convert boolean mask by DataFrame.gt to integers:
df1 = df.gt(0).astype(int)
Or use DataFrame.clip if integers and no negative values:
df1 = df.clip(upper=1)
Your solution should working with loc:
df.loc[:] = np.where(df>0, 1, 0)
of course it is possible by func, it can be done with just operator
(df > 0) * 1
Without using numpy:
df[df>0]=0
I want to select rows based on a mask, idx. I can think of two different possibilities, either using iloc or just using brackets. I have shown the two possibilities (on a dataframe df) below. Are they both equally viable?
idx = (df["timestamp"] >= 5) & (df["timestamp"] <= 10)
idx = idx.values
hr = df["hr"].iloc[idx]
timestamps = df["timestamp"].iloc[idx]
or the following one:
idx = (df["timestamp"] >= 5) & (df["timestamp"] <= 10)
hr = df["hr"][idx]
timestamps = df["timestamp"][idx]
No, they are not the same. One uses direct syntax while the other relies on chained indexing.
The crucial points are:
pd.DataFrame.iloc is used primarily for integer position-based indexing.
pd.DataFrame.loc is most often used with labels or Boolean arrays.
Chained indexing, i.e. via df[x][y], is explicitly discouraged and is never necessary.
idx.values returns the numpy array representation of idx series. This cannot feed .iloc and is not necessary to feed .loc, which can take idx directly.
Below are two examples which would work. In either example, you can use similar syntax to mask a dataframe or series. For example, df['hr'].loc[mask] would work as well as df.loc[mask].
iloc
Here we use numpy.where to extract integer indices of True elements in a Boolean series. iloc does accept Boolean arrays but, in my opinion, this is less clear; "i" stands for integer.
idx = (df['timestamp'] >= 5) & (df['timestamp'] <= 10)
mask = np.where(idx)[0]
df = df.iloc[mask]
loc
Using loc is more natural when we are already querying by specific series.
mask = (df['timestamp'] >= 5) & (df['timestamp'] <= 10)
df = df.loc[mask]
When masking only rows, you can omit the loc accessor altogether and use df[mask].
If masking by rows and filtering for a column, you can use df.loc[mask, 'col_name']
Indexing and Selecting Data is fundamental to pandas: there is no substitute for reading the official documentation.
Don't mix __getitem__ based indexing and (i)loc based. Use one or the other. I prefer (i)loc when you're accessing by index, and __getitem__ when you're accessing by column or using boolean indexing.
Here's some commonly bad methods of indexing:
df.loc[idx].loc[:, col]
df.loc[idx][col]
df[column][idx]
df[column].loc[idx]
The correct method for all the above would be df.loc[idx, col]. If idx is an integer label, use df.loc[df.index[idx], col].
Most of these solutions will cause issues down the pipeline (mainly in the form of a SettingWithCopyWarning), when you try assigning to them, because these create views and are tied to the original DataFrame they're viewing into.
The correct solution to all these versions is df.iloc[idx, df.columns.get_loc(column)] Note that idx is an array of integer indexes, and column is a string label. Similarly for loc.
If you have an array of booleans, use loc instead, like this: df.loc[boolean_idx, column]
Furthermore, these are fine: df[column], and df[boolean_mask]
There are rules for indexing a single row or single column. Depending on how it is done, you will get either a Series or DataFrame. So, if you want to index the 100th row from a DataFrame df as a DataFrame slice, you need to do:
df.iloc[[100], :] # `:` selects every column
And not
df.iloc[100, :]
And similarly for the column-based indexing.
Lastly, if you want to index a single scalar, use at or iat.
OTOH, for your requirement, I would suggest a third alternative:
ts = df.loc[df.timestamp.between(5, 10), 'timestamp']
Or if you're subsetting the entire thing,
df = df[df.timestamp.between(5, 10)]
I have two date columns namely date1 and date2.
I am trying to select rows which have date1 later than date2
I tried to
print df[df.loc[df['date1']>df['date2']]]
but I recieved an error
ValueError: Boolean array expected for the condition, not float64
In either case, the idea is to retrieve a boolean mask. This boolean mask will then be used to index into the dataframe and retrieve corresponding rows. First, generate a mask:
mask = df['date1'] > df['date2']
Now, use this mask to index df:
df = df.loc[mask]
This can be written in a single line.
df = df.loc[df['date1'] > df['date2']]
You do not need to perform another level of indexing after this, df now has your final result. I recommend loc if you are planning to perform operations and reassignment on this filtered dataframe, because loc always returns a copy, while plain indexing returns a view.
Below are some more methods of doing the same thing:
Option 1
df.query
df.query('date1 > date2')
Option 2
df.eval
df[df.eval('date1 > date2')]
If your columns are not dates, you might as well cast them now. Use pd.to_datetime:
df.date1 = pd.to_datetime(df.date1)
df.date2 = pd.to_datetime(df.date2)
Or, when loading your CSV, make sure to set the parse_dates switch on:
df = pd.read_csv(..., parse_dates=['date1, date2'])
I would like to be able to assign to a DataFrame through chained indexers. Notionally like this:
subset = df.loc[mask]
... # much later
subset.loc[mask2, 'column'] += value
This does not work because, as I understand it, the second .loc triggers a copy-on-write. Is there a way to do this?
I could pass df and mask around so that the later code could combine mask and mask2 before making an assignment but it feels much cleaner to be able to pass around the subset view instead so that the later code only has to worry about it's own mask.
When you get to:
subset.loc[mask2, 'column']
assign this to another subset so you can access its index and columns attributes.
subsubset = subset.loc[mask2, 'column']
Then you can access df with subsubset's index and columns
df.loc[subsubset.index, subsubset.columns] += 1
I am working on a large dataset and there are a few duplicates in my index. I'd like to (perhaps visually) check what these duplicated rows are like and then decide which one to drop. Is there a way that I can select the slice of the dataframe that have duplicated indices (or duplicates in any columns)?
Any help is appreciated.
You can use pandas.duplicated and then slice it using a boolean. For more information on any method or advanced features, I would advise you to always check in its docstring.
Well, this would solve the case for you:
df[df.duplicated('Column Name', keep=False) == True]
Here,
keep=False will return all those rows having duplicate values in that column.
use duplicated method of DataFrame:
df.duplicated(cols=[...])
See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
EDIT
You can use:
df[df.duplicated(cols=[...]) | df.duplicated(cols=[...], take_last=True)]
or, you can use groupby and filter:
df.groupby([...]).filter(lambda df:df.shape[0] > 1)
or apply:
df.groupby([...], group_keys=False).apply(lambda df:df if df.shape[0] > 1 else None)