How to select rows based categories in Pandas dataframe - python

this is really trivial but can't believe I have wandered around for an hour and still can find the answer, so here you are:
df = pd.DataFrame({"cats":["a","b"], "vals":[1,2]})
df.cats = df.cats.astype("category")
df
My problem is how to select the row that its "cats" columns's category is "a". I know that df.loc[df.cats == "a"] will work but it's based on equality on element. Is there a way to select based on levels of category?

This works:
df.cats[df.cats=='a']
UPDATE
The question was updated. New solution:
df[df.cats.cat.categories == ['a']]

For those who are trying to filter rows based on a numerical categorical column:
df[df['col'] == pd.Interval(46, 53, closed='right')]
This would keep the rows where the col column has category (46, 53].
This kind of categorical column is common when you discretize numerical columns using pd.qcut() method.

You can query the categorical list using df.cats.cat.categories which prints output as
Index(['a', 'b'], dtype='object')
For this case, to select a row with category of 'a' which is df.cats.cat.categories['0'], you just use:
df[df.cats == df.cats.cat.categories[0]]

Using the isin function to create a boolean index is an approach that will extend to multiple categories, similar to R's %in% operator.
# will return desired subset
df[df.cats.isin(['a'])]
# can be extended to multiple categories
df[df.cats.isin(['a', 'b'])]

df[df.cats.cat.categories == df.cats.cat.categories[0]]

Related

Exclude values in DF column

I have a problem, I want to exclude from a column and drop from my DF all my rows finishing by "99".
I tried to create a list :
filteredvalues = [x for x in df['XX'] if x.endswith('99')]
I have in this list all the concerned rows but how to apply to my DF and drop those rows :
I tried a few things but nothing works :
Lately I tried this :
df = df[df['XX'] not in filteredvalues]
Any help on this?
Use the .str attribute, with corresponding string methods, to select such items. Then use ~ to negate the result, and filter your dataframe with that:
df = df[~df['XX'].str.endswith('99')]

Creating new dataframe by appending rows from an old dataframe

I'm trying to create a dataframe by selecting rows that meet only specific conditions from a different dataframe.
Technicians can only select one of several fields for Column 1 using a dropdown menu so I want to specify the specific field. However, column 2 is a freetext entry therefore I'm looking for two specific key words with any type of spelling/case.
I want all columns from the rows in the new dataframe.
Any help or insight would be much appreciated.
import pandas as pd
df = pd.read_excel (r'File.xlsx, sheet_name = 'Sheet1')
filter = ['x', 'y']
columns=df.columns
data = pd.DataFrame(columns=columns)
for row in df.iterrows():
if 'Column 1' == 'a':
row.data.append()
elif df['Column 2'].str.contains('filter', case = 'false'):
row.data.append()
print(data.head())
In general, it's best to have a vectorized solution to things, so I'll put my solution as follows (there are many ways to do this, this is one of the ways that came to my head). Here, you can use a simple boolean mask to filter out some specific rows that you don't want, since you've already clearly defined your criteria (df['Column 1'] == 'a' or df['Column 2'].str.contains('filter', case = 'false')).
As such, you can simply create a boolean mask that includes this criteria. By itself, df['Column 1'] == 'a' will create an indexing dataframe with the structure of [1, 0, 1, 1, ...], where each number corresponds to whether it's true in the original array. Once you have that, you can simply index back into the original array with df[df['Column 1'] == 'a'] to return your filtered array.
Of course, since you have two conditions here (which seem to follow an "or" clause), you can simply feed both of these conditions into the boolean mask, such as df[df['Column 1'] == 'a' & df['Column 2'].str.contains('filter', case = 'false')].
I'm not at my development computer, so this might not work as expected due to a couple minor issues, but this is the general idea. This line should replace your entire df.iterrows block. Hope this helps :)

Python.pandas: how to select rows where objects start with letters 'PL'

I have specific problem with pandas: I need to select rows in dataframe which start with specific letters.
Details: I've imported my data to dataframe and selected columns that I need. I've also narrowed it down to row index I need. Now I also need to select rows in other column where objects START with letters 'pl'.
Is there any solution to select row only based on first two characters in it?
I was thinking about
pl = df[‘Code’] == pl*
but it won't work due to row indexing. Advise appreciated!
Use startswith for this:
df = df[df['Code'].str.startswith('pl')]
Fully reproducible example for those who want to try it.
import pandas as pd
df = pd.DataFrame([["plusieurs", 1], ["toi", 2], ["plutot", 3]])
df.columns = ["Code", "number"]
df = df[df.Code.str.startswith("pl")] # alternative is df = df[df["Code"].str.startswith("pl")]
If you use a string method on the Series that should return you a true/false result. You can then use that as a filter combined with .loc to create your data subset.
new_df = df.loc[df[‘Code’].str.startswith('pl')].copy()
The condition is just a filter, then you need to apply it to the dataframe. as filter you may use the method Series.str.startswith and do
df_pl = df[df['Code'].str.startswith('pl')]

Assigning values to cross selection of MultiIndex DataFrame (ellipsis style of numpy)

In numpy we can select the last axis with ellipsis indexing, f.i. array[..., 4].
In Pandas DataFrames for structuring large amounts of data, I like to use MultiIndex (which I see as some kind of additional dimensions of the DataFrame). If I want to select a given subset of a DataFrame df, in this case all columns 'key' in the last level of the columns MultiIndex, I can do it with the cross selection method xs:
# create sample multiindex dataframe
mi = pd.MultiIndex.from_product((('a', 'b', 'c'), (1, 2), ('some', 'key', 'foo')))
data = pd.DataFrame(data=np.random.rand(20, 18), columns=mi)
# make cross selection:
xs_df = data.xs('key', axis=1, level=-1)
But if I want to assign values to the cross selection, xs won't work.
The documentation proposes to use IndexSlice to access and set values to a cross selection:
idx = pd.IndexSlice
data.loc[:, idx[:, :, 'key']] *= 10
Which is working well as long as I explicitly enter the number of levels by inserting the correct amount of : before 'key'.
Assuming I just want to give the number of levels to a selection function or f.i. always select the last level, independent of the number of levels of the DataFrame, this won't work (afaik).
My current workaround is using None slices for n_levels to skip:
n_levels = data.columns.nlevels - 1 # assuming I want to select the last level
data.loc[:, (*n_levels*[slice(None)], 'key')] *= 100
This is imho a quite nasty and cumbersome workaround. Is there any more pythonic/nicer/better way?
In this case, you may be better off with get_level_values:
s = data.columns.get_level_values(-1) == 'key'
data.loc[:,s] *= 10
I feel like we can do update and pass drop_level with xs
data.update(data.xs('key',level=-1,axis=1,drop_level=False)*10)
I don't think there is as straightforward a way to index and set values the way you want. Adding to previous answers, I'd suggest naming your columns, ... makes it easier to wrangle with the query method:
#assign names
data.columns = data.columns.set_names(['first','second','third'])
#select interested level :
ind=data.T.query('third=="key"').index
#assign value
data.loc(axis=1)[ind] *=10

Python dataframe groupby by dictionary list then sum

I have two dataframes. The first named mergedcsv is of the format:
mergedcsv dataframe
The second dataframe named idgrp_df is of a dictionary format which for each region Id a list of corresponding string ids.
idgrp_df dataframe - keys with lists
For each row in mergedcsv (and the corresponding row in idgrp_df) I wish to select the columns within mergedcsv where the column labels are equal to the list with idgrp_df for that row. Then sum the values of those particular values and add the output to a column within mergedcsv. The function will iterate through all rows in mergedcsv (582 rows x 600 columns).
My line of code to try to attempt this is:
mergedcsv['TotRegFlows'] = mergedcsv.groupby([idgrp_df],as_index=False).numbers.apply(lambda x: x.iat[0].sum())
It returns a ValueError: Grouper for class pandas.core.frame.DataFrame not 1-dimensional.
This relates to the input dataframe for the groupby. How can I access the list for each row as the input for the groupby?
So for example, for the first row in mergedcsv I wish to select the columns with labels F95RR04, F95RR06 and F95RR15 (reading from the list in the first row of idgrp_df). Sum the values in these columns for that row and insert the sum value into TotRegFlows column.
Any ideas as to how I can utilize the list would be very much appreciated.
Edits:
Many thanks IanS. Your solution is useful. Following modification of the code line based on this advice I realised that (as suggested) my index in both dataframes are out of sync. I tested the indices (mergedcsv had 'None' and idgrp_df has 'REG_ID' column as index. I set the mergedcsv to 'REG_ID' also. Then realised that the mergedcsv has 582 rows (the REG_ID is not unique) and the idgrp_df has 220 rows (REG_ID is unique). I therefor think I am missing a groupby based on REG_ID index in mergedcsv.
I have modified the code as follows:
mergedcsv.set_index('REG_ID', inplace=True)
print mergedcsv.index.name
print idgrp_df.index.name
mergedcsvgroup = mergedcsv.groupby('REG_ID')[mergedcsv.columns].apply(lambda y: y.tolist())
mergedcsvgroup['TotRegFlows'] = mergedcsvgroup.apply(lambda row: row[idgrp_df.loc[row.name]].sum(), axis=1)
I have a keyError:'REG_ID'.
Any further recommendations are most welcome. Would it be more efficient to combine the groupby and apply into one line?
I am new to working with pandas and trying to build experience in python
Further amendments:
Without an index for mergedcsv:
mergedcsv['TotRegFlows'] = mergedcsv.apply(lambda row: row[idgrp_df.loc[row.name]].groupby('REG_ID').sum(), axis=1)
this throws a KeyError: (the label[0] is not in the [index], u 'occurred at index 0')
With an index for mergedcsv:
mergedcsv.set_index('REG_ID', inplace=True)
columnlist = list(mergedcsv.columns.values)
mergedcsv['TotRegFlows'] = mergedcsv.apply(lambda row: row[idgrp_df.loc[row.name]].groupby('REG_ID')[columnlist].transform().sum(), axis=1)
this throws a TypeError: ("unhashable type:'list'", u'occurred at index 7')
Or finally separating the groupby function:
columnlist = list(mergedcsv.columns.values)
mergedcsvgroup = mergedcsv.groupby('REG_ID')
mergedcsv['TotRegFlows'] = mergedcsvgroup.apply(lambda row: row[idgrp_df.loc[row.name]].sum())
this throws a TypeError: unhashable type list. The axis=1 argument is not available also with groupby apply.
Any ideas how I can use the lists with the apply function? I've explored tuples in the apply code but have not had any success.
Any suggestions much appreciated.
If I understand correctly, I have a simple solution with apply:
Setup
import pandas as pd
df = pd.DataFrame({'A': [1,2,3], 'B': [4,5,6], 'C': [7,8,9]})
lists = pd.Series([['A', 'B'], ['A', 'C'], ['C']])
Solution
I apply a lambda function that gets the list of columns to be summed from the lists series:
df.apply(lambda row: row[lists[row.name]].sum(), axis=1)
The trick is that, when iterating over rows (axis=1), row.name is the original index of the dataframe df. I use that to access the list from the lists series.
Notes
This solution assumes that both dataframes share the same index, which appears not to be the case in the screenshots you included. You have to address that.
Also, if idgrp_df is a dataframe and not a series, then you need to access its values with .loc.

Categories

Resources