Select slice of dataframe according to value of multiIndex - python

I have a multiIndex dataframe.
I am able to create a logical mask using the following:
df.index.get_level_values(0).to_series().str.find('1000')!=-1
This returns a boolean True for all the rows where the first index level contains the characters '1000'and False otherwise.
But I am not able to slice the dataframe using that mask.
I tried with:
df[df.index.get_level_values(0).to_series().str.find('1000')!=-1]
and it returned the following error:
ValueError: cannot reindex from a duplicate axis
I also tried with:
df[df.index.get_level_values(0).to_series().str.find('1000')!=-1,:]
which only returns the logic mask as output and the following error:
Length: 1755, dtype: bool, slice(None, None, None))' is an invalid key
Can someone point me to the right solution and to a good reference on how to slice properly a multiIndex dataframe?

One idea is remove to_series() and use Series.str.contains for test substring:
df[df.index.get_level_values(0).str.contains('1000')]
Another is convert mask to numpy array:
df[df.index.get_level_values(0).str.contains('1000').values]
Your solution with converting values of mask to array:
df[(df.index.get_level_values(0).to_series().str.find('1000')!=-1).values]

Related

When creating a boolean series from subsetting a df, what does index-subsetting in the same line filter for?

I'm experimenting with pandas and noticed something that seemed odd.
If you have a boolean series defined as an object, you can then subset that object by index numbers, e.g.,
From df 'ah'
, creating
creating this boolean series, 'tah'
via
tah = ah['_merge']=='left_only'
This boolean series could be index-subset like this:
tah[0:1]
yielding:
Yet if I tried to do this all in one line
ah['_merge']=='left_only'[0:1]
I get an unexpected output, where the boolean series is neither sliced nor seems to correspond to the subsetted-column:
I've been experimenting and can't seem to determine what, in the all-in-on-line [0:1] is slicing/filtering-for. Any clarification would be appreciated!
Because you are equating string's 'left_over' index of 0 (letter 'l') and this yields false result in every row and since 'l' is not equal to 'left_over' nor to 'both', it prints all a column of false booleans.
You can use (ah['_merge']=='left_only')[0:1] as MattDMo mentionned in the comments.
Or you can also use pandas.Series.iloc with a slice object to select the elements you need based on their postion/index in your dataframe.
(ah['_merge']=='left_only').iloc[0:1]
Both of commands will return True since the first row of your dataframe has a 'left_only' type of merge.

Understanding bracket filter syntax in pandas

How does the following filter out the results in pandas ? For example, with this statement:
df[['name', 'id', 'group']][df.id.notnull()]
I get 426 rows (it filters out everything where df.group IS NOT NULL). However, if I just use that syntax by itself, it returns a bool for each row, {index: bool}:
[df.group.notnull()]
How does the bracket notation work with pandas ? Another example would be:
df.id[df.id==458514] # filters out rows
# vs
[df.id==458514] # returns a bool
Not a full answer, just a breakdown of df.id[df.id==458514]
df.id returns a series with the contents of column id
df.id[...] slices that series with either 1) a boolean mask, 2) a single index label or a list of them, 3) a slice of labels in the form start:end:step. If it receives a boolean mask then it must be of the same shape as the series being sliced. If it receives index label(s) then it will return those specific rows. Sliciing works just as with python lists, but start and end be integer locations or index labels (e.g. ['a':'e'] will return all rows in between, including 'e').
df.id[df.id==458514] returns a filtered series with your boolean mask, i.e. only the items where df.id equals 458514. It also works with other boolean masks as in df.id[df.name == 'Carl'] or df.id[df.name.isin(['Tom', 'Jerry'])].
Read more in panda's intro to data structures

What is the difference between where, mask and df[S>0] in pandas?

Say if I want to we slice the the dataframe with the condition (element > 0) :
How does where, mask, df[S>0] behave?
Thanks
where
Takes a boolean array or pandas object that keeps the values where the input is True and replaces them with np.nan. Optionally, you can pass an other argument that will be used to fill in instead of np.nan
mask
The same thing as where except it keeps the False and replaces the True
df[S > 0]
Filters df if S is a series. Otherwise works like where.

filtering using columns of different types

I'm new to python and I 'm struggling with filtering.
I'm running the following in python:
DataFrame[(DataFrame.column1<2 & DataFrame.column2=='text')]
and the error I get is
cannot compare a dtyped [object] array with a scalar of type [bool]
Column1 is a float64 type and column2 is object.
The filter must be a combination of both of them.
Any ideas?
This is a short example of the syntax you should use
import pandas as pd
df = pd.DataFrame()
# filling both columns with data
df[(df['column1']<2) & (df['column2']=='text')]
pd.DataFrame.__getitem__, or the equivalent syntax df[], does not permit Boolean indexing. Instead, you should use pd.DataFrame.loc. In addition, you should surround each condition with parentheses to avoid chained comparisons. For example:
mask = (df['column1'] < 2) & (df['column2'] == 'text')
df = df.loc[mask]
Note also you shouldn't name your dataframe DataFrame, this would shadow a class name.
For object dtype, note Pandas doesn't have a str dtype, these objects are stored in object dtype series. See also How to convert column with dtype as object to string in Pandas Dataframe. You shouldn't need to apply any conversion, and if you do, you can use df['column2'].astype(str) == 'text'.

Pandas selecting with unaligned indexes

I have 2 series.
The first one contains a list of numbers with an index counting 0..8.
A = pd.Series([2,3,4,6,5,4,7,6,5], name=['A'], index=[0,1,2,3,4,5,6,7,8])
The second one only contains True values, but the index of the series is a subset of the first one.
B = pd.Series([1, 1, 1, 1, 1], name=['B'], index=[0,2,4,7,8], dtype=bool)
I'd like to use B as boolean vector to get the A-values for the corresponding indexes, like:
A[B]
[...]
IndexingError: Unalignable boolean Series key provided
Unfortunately this raises an error.
Do I need to align them first?
Does
A[B.index.values]
work for your version of pandas? (I see we have different versions because now the Series name has to be hashable, so your code gave me an error)

Categories

Resources