Pandas df sum rows based on index column - python

I have a Pandas df (See below), I want to sum the values based on the index column. My index column contains string values. See the example below, here I am trying to add Moving, Playing and Using Phone together as "Active Time" and sum their corresponding values, while keep the other index values as these are already are. Any suggestions, that how can I work with this type of scenario?
**Activity AverageTime**
Moving 0.000804367
Playing 0.001191772
Stationary 0.320701558
Using Phone 0.594305473
Unknown 0.060697612
Idle 0.022299218

I am sure that there must be a simpler way of doing this, but here is one possible solution.
# Filters for active and inactive rows
active_row_names = ['Moving','Playing','Using Phone']
active_filter = [row in active_row_names for row in df.index]
inactive_filter = [not row for row in active_filter]
active = df.loc[active_filter].sum() # Sum of 'active' rows as a Series
active = pd.DataFrame(active).transpose() # as a dataframe, and fix orientation
active.index=["active"] # Assign new index name
# Keep the inactive rows as they are, and replace the active rows with the
# newly defined row that is the sum of the previous active rows.
df = df.loc[inactive_filter].append(active, ignore_index=False)
OUTPUT
Activity AverageTime
Stationary 0.320702
Unknown 0.060698
Idle 0.022299
active 0.596302
This will work even when only a subset of the active row names are present in the dataframe.

I would add a new boolean column called "active" and then groupby that column:
df['active']=False
df['active'][['Moving','Playing','Using Phone']] = True
df.groupby('active').AverageTime.sum()

Related

group by a dataframe by the max column value

I have this dataframe and I need to leave only the lines with the max value of the 'revisão' column referring to each value of the 'mesano' column
groupede=dfgc.groupby(['mesano','description','paymentCategories.description','paymentCategories.type']) result=groupede['revisao','paymentCategories.interval.totalPrice'].agg('max','sum')
and i try too
grouped=dfgc.groupby(['mesano','description','paymentCategories.description','paymentCategories.type','paymentCategories.interval.totalPrice'], as_index=False)['revisao'].max()
but this code is wrong
You can sort the dataframe by highest value in revisao and then drop all duplicate rows and only keep the first entry, essentially filtering by max value:
df.sort_values(by=['revisão'], ascending=False).drop_duplicates(keep='first')
We can choose only the rows for which the value of a particular column is equal to the maximum value for that column. This can be done by using Boolean index filtering, where a 1 in the boolean index specifies keeping this row, and 0 means dropping it. For your particular use case, you can use
df_max_revisão = df[df['revisão'] == df['revisão'].max()]
where df['revisão'] == df['revisão'].max() generates a boolean index, and df[boolean_index] gives you the rows with 1 in the boolean index.
If you want only the values in the 'mesano' column, you can filter the dataset and choose those by using
df_mesano = df['mesano']

How do I search a pandas dataframe to get the row with a cell matching a specified value?

I have a dataframe that might look like this:
print(df_selection_names)
name
0 fatty red meat, like prime rib
0 grilled
I have another dataframe, df_everything, with columns called name, suggestion and a lot of other columns. I want to find all the rows in df_everything with a name value matching the name values from df_selection_names so that I can print the values for each name and suggestion pair, e.g., "suggestion1 is suggested for name1", "suggestion2 is suggested for name2", etc.
I've tried several ways to get cell values from a dataframe and searching for values within a row including
# number of items in df_selection_names = df_selection_names.shape[0]
# so, in other words, we are looping through all the items the user selected
for i in range(df_selection_names.shape[0]):
# get the cell value using at() function
# in 'name' column and i-1 row
sel = df_selection_names.at[i, 'name']
# this line finds the row 'sel' in df_everything
row = df_everything[df_everything['name'] == sel]
but everything I tried gives me ValueErrors. This post leads me to think I may be
way off, but I'm feeling pretty confused about everything at this point!
https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html?highlight=isin#pandas.Series.isin
df_everything[df_everything['name'].isin(df_selection_names["name"])]

How do I get the range in one dataframe column based on duplicate items in two other columns?

I have a dataframe that contains three columns: 'sequences', 'smiles' and 'labels'. Some of the rows have the same string entries in the 'sequences' and 'smiles' column, but a different float value in the 'labels' column. For duplicate sequences and smiles, I would like the get the range of values of the 'labels' column for those duplicate rows, which will be stored in a fourth column. I intend to reject rows, which have a range above a certain value.
I have made a dataframe that contains all the duplicate values:
duplicate_df = pd.concat(g for _, g in df.groupby(['sequence', 'smiles']) if len(g) > 1)
How do I get the range of the labels from the df?
Is there something like this I can do?
duplicate_df.groupby(['Target_sequence', 'processed_SMILES']).range()
My duplicate_df looks like this:
pd.DataFrame({'Label': {86468: 55700.0,
86484: 55700.0,
86508: 55700.0,
124549: 55690.0,
124588: 55690.0},
'Target_sequence': {86468: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
86484: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
86508: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
124549: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
124588: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF'},
'processed_SMILES': {86468: 'CCOC(=O)[NH+]1CC[NH+](C(=O)c2ccc(-n3c(=S)[n-]c4ccccc4c3=O)cc2)CC1',
86484: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3ccccc3F)cs2)CC1',
86508: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3cccc([N+](=O)[O-])c3)cs2)CC1',
124549: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3cccc([N+](=O)[O-])c3)cs2)CC1',
124588: 'CCOC(=O)[NH+]1CC[NH+](C(=O)c2ccc(-n3c(=S)[n-]c4ccccc4c3=O)cc2)CC1'}})
For example, duplicate rows where the items are the same I would like to have 0 in the 'range' column.
std() is a valid aggregation function for group-by object. Therefore, after creating your df with the duplicated data, you can try:
duplicate_df.groupby(['Target_sequence', 'processed_SMILES'])['labels'].std()
Edit:
This is a nice opportunity to use pd.NamedAgg which was released in version 0.25:
df.groupby(['Target_sequence','processed_SMILES']).agg(Minimum = pd.NamedAgg(column='Label',aggfunc='min'),
Maximum = pd.NamedAgg(column='Label',aggfunc='max'))

For Python Pandas, how to implement a "running" check of 2 rows against the previous 2 rows?

[updated with expected outcome]
I'm trying to implement a "running" check where I need the sum and mean of two rows to be more than the previous 2 rows.
Referring to the dataframe (copied into spreadsheet) below, I'm trying code out a function where if the mean of those two orange cells is more than the blue cells, the function will return true for row 8, under a new column called 'Cond11'. The dataframe here is historical, so all rows are available.
Note that that Rows column is added in the spreadsheet, easier for me to reference the rows here.
I have been using .rolling to refer to the current row + whatever number of rows to refer to, or using shift(1) to refer to the previous row.
df.loc[:, ('Cond9')] = df.n.rolling(4).mean() >= 30
df.loc[:, ('Cond10')] = df.a > df.a.shift(1)
I'm stuck here... how to I do this 2 rows vs the previous 2 rows? Please advise!
The 2nd part of this question: I have another function that checks the latest rows in the dataframe for the same condition above. This function is meant to be used in real-time, when new data is streaming into the dataframe and the function is supposed to check the latest rows only.
Can I check if the following code works to detect the same conditions above?
cond11 = candles.n[-2:-1].sum() > candles.n[-4:-3].sum()
I believe this solves your problem:
df.rolling(4).apply(lambda rows: rows[0] + rows[1] < rows[2] + rows[3])
The first 3 rows will be NaNs but you did not define what you would like to happen there.
As for the second part, to be able to produce this condition live for new data you just have to prepend the last 3 rows of your current data and then apply the same process to it:
pd.concat([df[-3:], df])

Python: Matching a pattern and relocating values to another column

I am trying to select a range of numbers from one column 'Description' and then move this pattern to a new column called 'Seating' however the new column is not returning any values and is just populated with values equalling to 'none'. I have used a for loop to iterate through the columns to locate any rows with this pattern but as i said this returns values equal to none. Maybe I have defined the pattern incorrectly.
import re
import pandas as pd
# Defined the indexes
data = pd.read_csv('Inspections.csv').set_index('ACTIVITY DATE')
# Created a new column for seating which will be populated with pattern
data['SEATING'] = None
# Defining indexes for desired columns
index_description = data.columns.get_loc('PE DESCRIPTION')
index_seating = data.columns.get_loc('SEATING')
# Creating a pattern to be extracted
seating_pattern = r' \d([1-1] {1} [999-999] {3}\/[61-61] {2} [150-150] {3})'
# For loop to iterate through rows to find and extract pattern to 'Seating' column
for row in range(0, len(data)):
score = re.search(seating_pattern, data.iat[row, index_description])
data.iat[row, index_seating] = score
data
Output of code showing table where the columns are populated:
Following code populates seating column
I have tried .group() and it returns the following error AttributeError: 'NoneType' object has no attribute 'group'
What am I doing wrong in that it shows <re.Match object; span=(11, 17), match='(0-30)'> instead of the result from the pattern.
It's not completely clear to me what you want to extract with your pattern. But here's a suggestion that might help. With this small sample frame
df = pd.DataFrame({'Col1': ['RESTAURANT (0-30) SEATS MODERATE RISK',
'RESTAURANT (31-60) SEATS HIGH RISK']})
Col1
0 RESTAURANT (0-30) SEATS MODERATE RISK
1 RESTAURANT (31-60) SEATS HIGH RISK
this
df['Col2'] = df['Col1'].str.extract(r'\((\d+-\d+)\)')
gives you
Col1 Col2
0 RESTAURANT (0-30) SEATS MODERATE RISK 0-30
1 RESTAURANT (31-60) SEATS HIGH RISK 31-60
Selecting columns in pandas can be much easier than this
first take a copy of the dataframe to apply the changes safely and then select values as the following
data_copied = data.copy()
data_copied['SEATING'] = data_copied[(data_copied['Description'] <= start_range_value) & (data_copied['Description'] >= end_range_value)]
this link is helpful on building column by selecting based on rows of another column without changing values https://www.geeksforgeeks.org/how-to-select-rows-from-a-dataframe-based-on-column-values/
this question to dive into the same topic with more customization , it will make u solve similar more complex issues
pandas create new column based on values from other columns / apply a function of multiple columns, row-wise

Categories

Resources