Summing values up to a column value change in pandas dataframe - python

I have a pandas data frame that looks like this:
Count Status
Date
2021-01-01 11 1
2021-01-02 13 1
2021-01-03 14 1
2021-01-04 8 0
2021-01-05 8 0
2021-01-06 5 0
2021-01-07 2 0
2021-01-08 6 1
2021-01-09 8 1
2021-01-10 10 0
I want to calculate the difference between the initial and final value of the "Count" column before the "Status" column changes from 0 to 1 or vice-versa (for every cycle) and make a new dataframe out of these values.
The output for this example would be:
Cycle Difference
1 3
2 -6
3 2

Use GroupBy.agg by consecutive groups created by comapre shifted values with cumulative sum, last subtract last and first value:
df = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum().rename('Cycle'))['Count']
.agg(['first','last'])
.eval('last - first')
.reset_index(name='Difference'))
print (df)
Cycle Difference
0 1 3
1 2 -6
2 3 2
3 4 0
If need filter out groups rows with 1 row is possible add aggregation GroupBy.size and then filter oupt rows by DataFrame.loc:
df = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum().rename('Cycle'))['Count']
.agg(['first','last', 'size'])
.loc[lambda x: x['size'] > 1]
.eval('last - first')
.reset_index(name='Difference'))
print (df)
Cycle Difference
0 1 3
1 2 -6
2 3 2

You can use a GroupBy.agg on the groups formed of the consecutive values, then get the first minus last value (see below for variants):
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0])
)
output:
Status
1 3
2 -6
3 2
4 0
Name: Count, dtype: int64
If you only want to do this for groups of more than one element:
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0] if len(x)>1 else pd.NA)
.dropna()
)
output:
Status
1 3
2 -6
3 2
Name: Count, dtype: object
output as DataFrame:
add .rename_axis('Cycle').reset_index(name='Difference'):
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0] if len(x)>1 else pd.NA)
.dropna()
.rename_axis('Cycle').reset_index(name='Difference')
)
output:
Cycle Difference
0 1 3
1 2 -6
2 3 2

Related

Mesh / divide / explode values in a column of a DataFrames according to a number of meshes for each value

Given a DataFrame
df1 :
value mesh
0 10 2
1 12 3
2 5 2
obtain a new DataFrame df2 in which for each value of df1 there are mesh values, each one obtained by dividing the corresponding value of df1 by its mesh:
df2 :
value/mesh
0 5
1 5
2 4
3 4
4 4
5 2.5
6 2.5
More general:
df1 :
value mesh_value other_value
0 10 2 0
1 12 3 1
2 5 2 2
obtain:
df2 :
value/mesh_value other_value
0 5 0
1 5 0
2 4 1
3 4 1
4 4 1
5 2.5 2
6 2.5 2
You can do map
df2['new'] = df2['value/mesh'].map(dict(zip(df1.eval('value/mesh'),df1.index)))
Out[243]:
0 0
1 0
2 1
3 1
4 1
5 2
6 2
Name: value/mesh, dtype: int64
Try as follows:
Use Series.div for value / mesh_value, and apply Series.reindex using np.repeat with df.mesh_value as the input array for the repeats parameter.
Next, use pd.concat to combine the result with df.other_value along axis=1.
Finally, rename the column with result of value / mesh_value (its default name will be 0) using df.rename, and chain df.reset_index to reset to a standard index.
df2 = pd.concat([df.value.div(df.mesh_value).reindex(
np.repeat(df.index,df.mesh_value)),df.other_value], axis=1)\
.rename(columns={0:'value_mesh_value'}).reset_index(drop=True)
print(df2)
value_mesh_value other_value
0 5.0 0
1 5.0 0
2 4.0 1
3 4.0 1
4 4.0 1
5 2.5 2
6 2.5 2
Or slightly different:
Use df.assign to add a column with the result of df.value.div(df.mesh_value), and reindex / rename in same way as above.
Use df.drop to get rid of columns that you don't want (value, mesh_value) and use df.iloc to change the column order (e.g. we want ['value_mesh_value','other_value'] instead of other way around (hence: [1,0]). And again, reset index.
We put all of this between brackets and assign it to df2.
df2 = (df.assign(tmp=df.value.div(df.mesh_value)).reindex(
np.repeat(df.index,df.mesh_value))\
.rename(columns={'tmp':'value_mesh_value'})\
.drop(columns=['value','mesh_value']).iloc[:,[1,0]]\
.reset_index(drop=True))
# same result

Filter a dataframe based on min values in one column by group in another column [duplicate]

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')

Show entire row data after using Pandas group by [duplicate]

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')

How to get average values from specific value type?

I have a dataframe:
id value
a 0:3,1:0,2:0,3:4
a 0:0,1:0,2:2,3:0
a 0:0,1:5,2:4,3:0
How to get average values of keys in column value?
For example for 0:3,1:0,2:0,3:4 it must be (0+0+0+3+3+3+3)/7 = 1.71.
For 0:0,1:0,2:2,3:0 it must be 2+2/2=2.
For 0:0,1:5,2:4,3:0 it must be (1+1+1+1+1+2+2+2+2)/9 = 1.44.
So desired result is:
id value
a 1.71
a 1.66
a 1.44
How to do that? Its not a dictionary values, so I dont really understand how to operate with keys and values here
You could use:
g = (df.join(df['value'].str.split(',')
.explode()
.str.split(':', expand=True)
.astype(int)
)
.assign(value=lambda d: d[0]*d[1])
.set_index('id', append=True)
.groupby(level=[0,1])
)
g['value'].sum()/g[1].sum()
output:
id
0 a 1.7143
1 a 2.0000
2 a 1.4444
how it works:
1- splitting the values and create new columns
(df.join(df['value'].str.split(',')
.explode()
.str.split(':', expand=True)
.astype(int)
)
.assign(value=lambda d: d[0]*d[1])
.set_index('id', append=True)
)
value 0 1
id
0 a 0 0 3
a 0 1 0
a 0 2 0
a 12 3 4
1 a 0 0 0
a 0 1 0
a 4 2 2
a 0 3 0
2 a 0 0 0
a 5 1 5
a 8 2 4
a 0 3 0
2- computing sum of values and sum of weights to get the mean
>>> g['value'].sum()
id
0 a 12
1 a 4
2 a 13
>>> g[1].sum()
id
0 a 7
1 a 2
2 a 9
>>> g['value'].sum()/g[1].sum()
id
0 a 1.7143
1 a 2.0000
2 a 1.4444

Python Pandas - Selecting specific rows based on the max and min of two columns with the same group id

I am looking for a way to identify the row that is the 'master' row. The way I am defining the master row is for each group id the row that has the minimum in cust_hierarchy then if it is a tie use the row with the most recent date.
I have supplied some sample tables below:
row_id
group_id
cust_hierarchy
most_recent_date
master(I am looking for)
1
0
2
2020-01-03
1
2
0
7
2019-01-01
0
3
1
7
2019-05-01
0
4
1
6
2019-04-01
0
5
1
6
2019-04-03
1
I was thinking of possibly ordering by the two columns (cust_hierarchy (ascending), most_recent_date (descending), and then a new column that places a 1 on the first row for each group id?
Does anyone have any helpful code for this?
You basically can to an groupby with an idxmin(), but with a little bit of sorting to ensure the most recent use date is selected by the min operation:
import pandas as pd
import numpy as np
# example data
dates = ['2020-01-03','2019-01-01','2019-05-01',
'2019-04-01','2019-04-03']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'group_id':[0,0,1,1,1],
'cust_hierarchy':[2,7,7,6,6,],
'most_recent_date':dates})
# solution
df = df.sort_values('most_recent_date', ascending=False)
idxs = df.groupby('group_id')['cust_hierarchy'].idxmin()
df['master'] = np.where(df.index.isin(idxs), True, False)
df = df.sort_index()
df before:
group_id cust_hierarchy most_recent_date
0 0 2 2020-01-03
1 0 7 2019-01-01
2 1 7 2019-05-01
3 1 6 2019-04-01
4 1 6 2019-04-03
df after:
group_id cust_hierarchy most_recent_date master
0 0 2 2020-01-03 True
1 0 7 2019-01-01 False
2 1 7 2019-05-01 False
3 1 6 2019-04-01 False
4 1 6 2019-04-03 True
Use duplicated on sort_values:
df['master'] = 1- (df.sort_values(['cust_hierarchy', 'most_recent_date'],
ascending=[False, True])
.duplicated('group_id', keep='last')
.astype(int)
)

Categories

Resources