group by a dataframe by the max column value - python

I have this dataframe and I need to leave only the lines with the max value of the 'revisão' column referring to each value of the 'mesano' column
groupede=dfgc.groupby(['mesano','description','paymentCategories.description','paymentCategories.type']) result=groupede['revisao','paymentCategories.interval.totalPrice'].agg('max','sum')
and i try too
grouped=dfgc.groupby(['mesano','description','paymentCategories.description','paymentCategories.type','paymentCategories.interval.totalPrice'], as_index=False)['revisao'].max()
but this code is wrong

You can sort the dataframe by highest value in revisao and then drop all duplicate rows and only keep the first entry, essentially filtering by max value:
df.sort_values(by=['revisão'], ascending=False).drop_duplicates(keep='first')

We can choose only the rows for which the value of a particular column is equal to the maximum value for that column. This can be done by using Boolean index filtering, where a 1 in the boolean index specifies keeping this row, and 0 means dropping it. For your particular use case, you can use
df_max_revisão = df[df['revisão'] == df['revisão'].max()]
where df['revisão'] == df['revisão'].max() generates a boolean index, and df[boolean_index] gives you the rows with 1 in the boolean index.
If you want only the values in the 'mesano' column, you can filter the dataset and choose those by using
df_mesano = df['mesano']

Related

How do I get the range in one dataframe column based on duplicate items in two other columns?

I have a dataframe that contains three columns: 'sequences', 'smiles' and 'labels'. Some of the rows have the same string entries in the 'sequences' and 'smiles' column, but a different float value in the 'labels' column. For duplicate sequences and smiles, I would like the get the range of values of the 'labels' column for those duplicate rows, which will be stored in a fourth column. I intend to reject rows, which have a range above a certain value.
I have made a dataframe that contains all the duplicate values:
duplicate_df = pd.concat(g for _, g in df.groupby(['sequence', 'smiles']) if len(g) > 1)
How do I get the range of the labels from the df?
Is there something like this I can do?
duplicate_df.groupby(['Target_sequence', 'processed_SMILES']).range()
My duplicate_df looks like this:
pd.DataFrame({'Label': {86468: 55700.0,
86484: 55700.0,
86508: 55700.0,
124549: 55690.0,
124588: 55690.0},
'Target_sequence': {86468: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
86484: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
86508: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
124549: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
124588: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF'},
'processed_SMILES': {86468: 'CCOC(=O)[NH+]1CC[NH+](C(=O)c2ccc(-n3c(=S)[n-]c4ccccc4c3=O)cc2)CC1',
86484: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3ccccc3F)cs2)CC1',
86508: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3cccc([N+](=O)[O-])c3)cs2)CC1',
124549: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3cccc([N+](=O)[O-])c3)cs2)CC1',
124588: 'CCOC(=O)[NH+]1CC[NH+](C(=O)c2ccc(-n3c(=S)[n-]c4ccccc4c3=O)cc2)CC1'}})
For example, duplicate rows where the items are the same I would like to have 0 in the 'range' column.
std() is a valid aggregation function for group-by object. Therefore, after creating your df with the duplicated data, you can try:
duplicate_df.groupby(['Target_sequence', 'processed_SMILES'])['labels'].std()
Edit:
This is a nice opportunity to use pd.NamedAgg which was released in version 0.25:
df.groupby(['Target_sequence','processed_SMILES']).agg(Minimum = pd.NamedAgg(column='Label',aggfunc='min'),
Maximum = pd.NamedAgg(column='Label',aggfunc='max'))

Panda groupby shifting and count at same time

Basically I am trying the take the previous row for the combination of ['dealer','State','city']. If I have multiple values in this combination I will get the Shifted value of this combination.
df['ShiftBY_D_S_C']= df.groupby(['dealer','State','city'])['dealer'].shift(1)
I am taking this ShiftBY_D_S_C column again and trying to take the count for the ['ShiftBY_D_S_C','State','city'] combination.
df['NewColumn'] = (df.groupby(['ShiftBY_D_S_C','State','city'])['ShiftBY_D_S_C'].transform("count"))+1
Below table shows what I am trying to do and it works well also. But when all the rows in ShiftBY_D_S_C column is nulls, this not working, as it have all null values. Any suggestions?
I am trying to see the NewColumn values like below when all the values in ShiftBY_D_S_C are NaN.
You could simply handle the special case that you describe with an if/else case:
if df['ShiftBY_D_S_C'].isna().all():
df['NewColumn'] = 1
else:
df['NewColumn'] = df.groupby(...)

Use of a slice and a boolean index in the same iloc statement

"Python for data analysis" (ch5) uses a double selection:
data.iloc[:,:3][data.three>5]
There is no explanation of the logic behind this statement. How should it be understood?
Is it a selection over a previous selection, i.e. data.iloc[:,:3] first selects all lines and first three columns, then [data.three>5] reduces this selection to all lines for which the values in column 'three' is greater than 5 ?
I saw also the following expression:
df[['CoCode','Doc_Type','Doc_Nr','Amount_LC']][df['Amount_LC']>1000000000]
I am a bit lost. It looks like loc and iloc can be used with double selection, i.e df.loc[][] what is the logic of the second []? What goes in the first one, and in the second ?
Two separate selections are being applied here to dataframe data:
1) data.iloc[:,:3] is selecting all rows, and all columns up to (but not including) column index 3, thus column indices 0, 1 and 2
2) The dataframe data is being limited to all rows where column three contains values greater than 5
The output of these two selections is independent of ordering, therefore:
data.iloc[:,:3][data.three>5] == data[data.three>5].iloc[:,:3] will return a dataframe populated with True
Note that you are not using double selection here (as you call it), but rather you are querying specific rows and columns in your first selection, while your second selection is merely a filter applied to the dataframe returned by your first selection.
Effectively, you are using .iloc() to select specific index locations (or slices) from the dataframe, while .loc() allows to select specific locations based on column and row labels.
Finally, when filtering your dataframe with something like data[data.three>5], you can read this as "Return rows in dataframe data where the column three of that row has a value greater than 5".
iloc and loc take 2 parameters , columns and rows.
data.iloc[<row selection> , <column selection>]
Hope this helped.
Is it a selection over a previous selection, i.e. data.iloc[:,:3] first selects all lines and first three columns, then [data.three>5] reduces this selection to all lines for which the values in column 'three' is greater than 5 ?
Yes, #rahlf23 has great explanation.
It looks like loc and iloc can be used with double selection, i.e df.loc[][] what is the logic of the second []? What goes in the first one, and in the second ?
Even you can make triple or more selection of rows.
Example:
df = pd.DataFrame({'a':[1,2,3,4,5], 'b':[6,7,8,9,10], 'c': [11,12,13,14,15]})
# It will give you first 3 rows of column a and b
df.loc[:,:2][:4][:3]
# It will give you {'a':[2,3], 'b':[7,8]}
df.iloc[:,:2][df.a*7 > df.c][:2]
# It will give you error, you can't slice more on columns
df.iloc[:,:2][:3,:1]

Python Pandas - How to filter multiple columns by one value

I am doing analysis by pandas.
My table has 7M rows* 30 columns. Cell values are ranged from -1 to 3 randomly. Now I want to filter out rows based on columns' value. I understand how to select based on multiple conditions, write down conditions and combine by "&" "|". But I have 30 columns to filter and filter by the same value. For instance, last 12 columns' value equals -1 need to be selected
df.iloc[:,-12:]==-1
The code above gives me a boolean. I need actual data frame. The logic here is "or", means if any column has value -1, that row needs to be selected.
Also, it is good to know what if I need "and", all columns have value -1?
Many thanks
For the OR case, use DF.any (returns True if any element is True along a particular axis):
df[(df.iloc[:,-12:] == -1).any(axis=1)]
For the AND case, use DF.all (returns True if all elements are True along a particular axis):
df[(df.iloc[:,-12:] == -1).all(axis=1)]

Pandas df sum rows based on index column

I have a Pandas df (See below), I want to sum the values based on the index column. My index column contains string values. See the example below, here I am trying to add Moving, Playing and Using Phone together as "Active Time" and sum their corresponding values, while keep the other index values as these are already are. Any suggestions, that how can I work with this type of scenario?
**Activity AverageTime**
Moving 0.000804367
Playing 0.001191772
Stationary 0.320701558
Using Phone 0.594305473
Unknown 0.060697612
Idle 0.022299218
I am sure that there must be a simpler way of doing this, but here is one possible solution.
# Filters for active and inactive rows
active_row_names = ['Moving','Playing','Using Phone']
active_filter = [row in active_row_names for row in df.index]
inactive_filter = [not row for row in active_filter]
active = df.loc[active_filter].sum() # Sum of 'active' rows as a Series
active = pd.DataFrame(active).transpose() # as a dataframe, and fix orientation
active.index=["active"] # Assign new index name
# Keep the inactive rows as they are, and replace the active rows with the
# newly defined row that is the sum of the previous active rows.
df = df.loc[inactive_filter].append(active, ignore_index=False)
OUTPUT
Activity AverageTime
Stationary 0.320702
Unknown 0.060698
Idle 0.022299
active 0.596302
This will work even when only a subset of the active row names are present in the dataframe.
I would add a new boolean column called "active" and then groupby that column:
df['active']=False
df['active'][['Moving','Playing','Using Phone']] = True
df.groupby('active').AverageTime.sum()

Categories

Resources