Understanding bracket filter syntax in pandas

Understanding bracket filter syntax in pandas - python

How does the following filter out the results in pandas ? For example, with this statement:
df[['name', 'id', 'group']][df.id.notnull()]
I get 426 rows (it filters out everything where df.group IS NOT NULL). However, if I just use that syntax by itself, it returns a bool for each row, {index: bool}:
[df.group.notnull()]
How does the bracket notation work with pandas ? Another example would be:
df.id[df.id==458514] # filters out rows
# vs
[df.id==458514] # returns a bool

Not a full answer, just a breakdown of df.id[df.id==458514]
df.id returns a series with the contents of column id
df.id[...] slices that series with either 1) a boolean mask, 2) a single index label or a list of them, 3) a slice of labels in the form start:end:step. If it receives a boolean mask then it must be of the same shape as the series being sliced. If it receives index label(s) then it will return those specific rows. Sliciing works just as with python lists, but start and end be integer locations or index labels (e.g. ['a':'e'] will return all rows in between, including 'e').
df.id[df.id==458514] returns a filtered series with your boolean mask, i.e. only the items where df.id equals 458514. It also works with other boolean masks as in df.id[df.name == 'Carl'] or df.id[df.name.isin(['Tom', 'Jerry'])].
Read more in panda's intro to data structures

Related

When creating a boolean series from subsetting a df, what does index-subsetting in the same line filter for?

I'm experimenting with pandas and noticed something that seemed odd.
If you have a boolean series defined as an object, you can then subset that object by index numbers, e.g.,
From df 'ah'
, creating
creating this boolean series, 'tah'
via
tah = ah['_merge']=='left_only'
This boolean series could be index-subset like this:
tah[0:1]
yielding:
Yet if I tried to do this all in one line
ah['_merge']=='left_only'[0:1]
I get an unexpected output, where the boolean series is neither sliced nor seems to correspond to the subsetted-column:
I've been experimenting and can't seem to determine what, in the all-in-on-line [0:1] is slicing/filtering-for. Any clarification would be appreciated!

Because you are equating string's 'left_over' index of 0 (letter 'l') and this yields false result in every row and since 'l' is not equal to 'left_over' nor to 'both', it prints all a column of false booleans.

You can use (ah['_merge']=='left_only')[0:1] as MattDMo mentionned in the comments.
Or you can also use pandas.Series.iloc with a slice object to select the elements you need based on their postion/index in your dataframe.
(ah['_merge']=='left_only').iloc[0:1]
Both of commands will return True since the first row of your dataframe has a 'left_only' type of merge.

Select slice of dataframe according to value of multiIndex

I have a multiIndex dataframe.
I am able to create a logical mask using the following:
df.index.get_level_values(0).to_series().str.find('1000')!=-1
This returns a boolean True for all the rows where the first index level contains the characters '1000'and False otherwise.
But I am not able to slice the dataframe using that mask.
I tried with:
df[df.index.get_level_values(0).to_series().str.find('1000')!=-1]
and it returned the following error:
ValueError: cannot reindex from a duplicate axis
I also tried with:
df[df.index.get_level_values(0).to_series().str.find('1000')!=-1,:]
which only returns the logic mask as output and the following error:
Length: 1755, dtype: bool, slice(None, None, None))' is an invalid key
Can someone point me to the right solution and to a good reference on how to slice properly a multiIndex dataframe?

One idea is remove to_series() and use Series.str.contains for test substring:
df[df.index.get_level_values(0).str.contains('1000')]
Another is convert mask to numpy array:
df[df.index.get_level_values(0).str.contains('1000').values]
Your solution with converting values of mask to array:
df[(df.index.get_level_values(0).to_series().str.find('1000')!=-1).values]

Best practices for indexing with pandas

I want to select rows based on a mask, idx. I can think of two different possibilities, either using iloc or just using brackets. I have shown the two possibilities (on a dataframe df) below. Are they both equally viable?
idx = (df["timestamp"] >= 5) & (df["timestamp"] <= 10)
idx = idx.values
hr = df["hr"].iloc[idx]
timestamps = df["timestamp"].iloc[idx]
or the following one:
idx = (df["timestamp"] >= 5) & (df["timestamp"] <= 10)
hr = df["hr"][idx]
timestamps = df["timestamp"][idx]

No, they are not the same. One uses direct syntax while the other relies on chained indexing.
The crucial points are:
pd.DataFrame.iloc is used primarily for integer position-based indexing.
pd.DataFrame.loc is most often used with labels or Boolean arrays.
Chained indexing, i.e. via df[x][y], is explicitly discouraged and is never necessary.
idx.values returns the numpy array representation of idx series. This cannot feed .iloc and is not necessary to feed .loc, which can take idx directly.
Below are two examples which would work. In either example, you can use similar syntax to mask a dataframe or series. For example, df['hr'].loc[mask] would work as well as df.loc[mask].
iloc
Here we use numpy.where to extract integer indices of True elements in a Boolean series. iloc does accept Boolean arrays but, in my opinion, this is less clear; "i" stands for integer.
idx = (df['timestamp'] >= 5) & (df['timestamp'] <= 10)
mask = np.where(idx)[0]
df = df.iloc[mask]
loc
Using loc is more natural when we are already querying by specific series.
mask = (df['timestamp'] >= 5) & (df['timestamp'] <= 10)
df = df.loc[mask]
When masking only rows, you can omit the loc accessor altogether and use df[mask].
If masking by rows and filtering for a column, you can use df.loc[mask, 'col_name']
Indexing and Selecting Data is fundamental to pandas: there is no substitute for reading the official documentation.

Don't mix __getitem__ based indexing and (i)loc based. Use one or the other. I prefer (i)loc when you're accessing by index, and __getitem__ when you're accessing by column or using boolean indexing.
Here's some commonly bad methods of indexing:
df.loc[idx].loc[:, col]
df.loc[idx][col]
df[column][idx]
df[column].loc[idx]
The correct method for all the above would be df.loc[idx, col]. If idx is an integer label, use df.loc[df.index[idx], col].
Most of these solutions will cause issues down the pipeline (mainly in the form of a SettingWithCopyWarning), when you try assigning to them, because these create views and are tied to the original DataFrame they're viewing into.
The correct solution to all these versions is df.iloc[idx, df.columns.get_loc(column)] Note that idx is an array of integer indexes, and column is a string label. Similarly for loc.
If you have an array of booleans, use loc instead, like this: df.loc[boolean_idx, column]
Furthermore, these are fine: df[column], and df[boolean_mask]
There are rules for indexing a single row or single column. Depending on how it is done, you will get either a Series or DataFrame. So, if you want to index the 100th row from a DataFrame df as a DataFrame slice, you need to do:
df.iloc[[100], :] # `:` selects every column
And not
df.iloc[100, :]
And similarly for the column-based indexing.
Lastly, if you want to index a single scalar, use at or iat.
OTOH, for your requirement, I would suggest a third alternative:
ts = df.loc[df.timestamp.between(5, 10), 'timestamp']
Or if you're subsetting the entire thing,
df = df[df.timestamp.between(5, 10)]

Pandas selecting with unaligned indexes

I have 2 series.
The first one contains a list of numbers with an index counting 0..8.
A = pd.Series([2,3,4,6,5,4,7,6,5], name=['A'], index=[0,1,2,3,4,5,6,7,8])
The second one only contains True values, but the index of the series is a subset of the first one.
B = pd.Series([1, 1, 1, 1, 1], name=['B'], index=[0,2,4,7,8], dtype=bool)
I'd like to use B as boolean vector to get the A-values for the corresponding indexes, like:
A[B]
[...]
IndexingError: Unalignable boolean Series key provided
Unfortunately this raises an error.
Do I need to align them first?

Does
A[B.index.values]
work for your version of pandas? (I see we have different versions because now the Series name has to be hashable, so your code gave me an error)

Converting list in panda dataframe into columns

city state neighborhoods categories
Dravosburg PA [asas,dfd] ['Nightlife']
Dravosburg PA [adad] ['Auto_Repair','Automotive']
I have above dataframe I want to convert each element of a list into column for eg:
city state asas dfd adad Nightlife Auto_Repair Automotive
Dravosburg PA 1 1 0 1 1 0
I am using following code to do this :
def list2columns(df):
"""
to convert list in the columns
of a dataframe
"""
columns=['categories','neighborhoods']
for col in columns:
for i in range(len(df)):
for element in eval(df.loc[i,"categories"]):
if len(element)!=0:
if element not in df.columns:
df.loc[:,element]=0
else:
df.loc[i,element]=1
How to do this in more efficient way?
Why still there is below warning when I am using df.loc already
SettingWithCopyWarning: A value is trying to be set on a copy of a slice
from a DataFrame.Try using .loc[row_indexer,col_indexer] = value instead

Since you're using eval(), I assume each column has a string representation of a list, rather than a list itself. Also, unlike your example above, I'm assuming there are quotes around the items in the lists in your neighborhoods column (df.iloc[0, 'neighborhoods'] == "['asas','dfd']"), because otherwise your eval() would fail.
If this is all correct, you could try something like this:
def list2columns(df):
"""
to convert list in the columns of a dataframe
"""
columns = ['categories','neighborhoods']
new_cols = set() # list of all new columns added
for col in columns:
for i in range(len(df[col])):
# get the list of columns to set
set_cols = eval(df.iloc[i, col])
# set the values of these columns to 1 in the current row
# (if this causes new columns to be added, other rows will get nans)
df.iloc[i, set_cols] = 1
# remember which new columns have been added
new_cols.update(set_cols)
# convert any un-set values in the new columns to 0
df[list(new_cols)].fillna(value=0, inplace=True)
# if that doesn't work, this may:
# df.update(df[list(new_cols)].fillna(value=0))
I can only speculate on an answer to your second question, about the SettingWithCopy warning.
It's possible (but unlikely) that using df.iloc instead of df.loc will help, since that is intended to select by row number (in your case, df.loc[i, col] only works because you haven't set an index, so pandas uses the default index, which matches the row number).
Another possibility is that the df that is passed in to your function is already a slice from a larger dataframe, and that is causing the SettingWithCopy warning.
I've also found that using df.loc with mixed indexing modes (logical selectors for rows and column names for columns) produces the SettingWithCopy warning; it's possible that your slice selectors are causing similar problems.
Hopefully the simpler and more direct indexing in the code above will solve any of these problems. But please report back (and provide code to generate df) if you are still seeing that warning.

Use this instead
def list2columns(df):
"""
to convert list in the columns
of a dataframe
"""
df = df.copy()
columns=['categories','neighborhoods']
for col in columns:
for i in range(len(df)):
for element in eval(df.loc[i,"categories"]):
if len(element)!=0:
if element not in df.columns:
df.loc[:,element]=0
else:
df.loc[i,element]=1
return df

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understanding bracket filter syntax in pandas - python

Related

When creating a boolean series from subsetting a df, what does index-subsetting in the same line filter for?

Select slice of dataframe according to value of multiIndex

Best practices for indexing with pandas

Pandas selecting with unaligned indexes

Converting list in panda dataframe into columns

Categories

Resources