Taking Differences of Records When Status Changes - Pandas - python

I have customer records with id, timestamp and status.
ID, TS, STATUS
1 10 GOOD
1 20 GOOD
1 25 BAD
1 30 BAD
1 50 BAD
1 600 GOOD
2 40 GOOD
.. ...
I am trying to calculate how much time is spent in consecutive BAD statuses (lets imagine order above is correct) per customer. So for customer id=1, 30-25,50-30,600-50 in total 575 seconds was spent in BAD status.
What is the method of doing this in Pandas? If I calculate .diff() on TS, that would give me differences, but how can I tie that 1) to the customer 2) certain status "blocks" for that customer?
Sample data:
df = pandas.DataFrame({'ID':[1,1,1,1,1,1,2],
'TS':[10,20,25,30,50,600,40],
'Status':['G','G','B','B','B','G','G']
},
columns=['ID','TS','Status'])
Thanks,

In [1]: df = DataFrame({'ID':[1,1,1,1,1,2,2],'TS':[10,20,25,30,50,10,40],'Stat
us':['G','G','B','B','B','B','B']}, columns=['ID','TS','Status'])
In [2]: f = lambda x: x.diff().sum()
In [3]: df['diff'] = df[df.Status=='B'].groupby('ID')['TS'].transform(f)
In [4]: df
Out[4]:
ID TS Status diff
0 1 10 G NaN
1 1 20 G NaN
2 1 25 B 25
3 1 30 B 25
4 1 50 B 25
5 2 10 B 30
6 2 40 B 30
Explanation:
Subset the dataframe to only those records with the desired Status. Groupby the ID and apply the lambda function diff().sum() to each group. Use transform instead of apply because transform returns an indexed series which you can use to assign to a new column 'diff'.
EDIT: New response to account for expanded question scope.
In [1]: df
Out[1]:
ID TS Status
0 1 10 G
1 1 20 G
2 1 25 B
3 1 30 B
4 1 50 B
5 1 600 G
6 2 40 G
In [2]: df['shift'] = -df['TS'].diff(-1)
In [3]: df['diff'] = df[df.Status=='B'].groupby('ID')['shift'].transform('sum')
In [4]: df
Out[4]:
ID TS Status shift diff
0 1 10 G 10 NaN
1 1 20 G 5 NaN
2 1 25 B 5 575
3 1 30 B 20 575
4 1 50 B 550 575
5 1 600 G -560 NaN
6 2 40 G NaN NaN

Here's a solution to separately aggregate each contiguous block of bad status (part 2 of your question?).
In [5]: df = pandas.DataFrame({'ID':[1,1,1,1,1,1,1,1,2,2,2],
'TS':[10,20,25,30,50,600,650,670,40,50,60],
'Status':['G','G','B','B','B','G','B','B','G','B','B']
},
columns=['ID','TS','Status'])
In [6]: grp = df.groupby('ID')
In [7]: def status_change(df):
...: return (df.Status.shift(1) != df.Status).astype(int)
...:
In [8]: df['BlockId'] = grp.apply(lambda df: status_change(df).cumsum())
In [9]: df['Duration'] = grp.TS.diff().shift(-1)
In [10]: df
Out[10]:
ID TS Status BlockId Duration
0 1 10 G 1 10
1 1 20 G 1 5
2 1 25 B 2 5
3 1 30 B 2 20
4 1 50 B 2 550
5 1 600 G 3 50
6 1 650 B 4 20
7 1 670 B 4 NaN
8 2 40 G 1 10
9 2 50 B 2 10
10 2 60 B 2 NaN
In [11]: df[df.Status == 'B'].groupby(['ID', 'BlockId']).Duration.sum()
Out[11]:
ID BlockId
1 2 575
4 20
2 2 10
Name: Duration

Related

Pandas groupby filter only last two rows

I am working on pandas manipulation and want to select only the last two rows for each column "B".
How to do without reset_index and filter (do inside groupby)
import pandas as pd
df = pd.DataFrame({
'A': list('aaabbbbcccc'),
'B': [0,1,2,5,7,2,1,4,1,0,2],
'V': range(10,120,10)
})
df
My attempt
df.groupby(['A','B'])['V'].sum()
Required output
A B
a
1 20
2 30
b
5 40
7 50
c
2 110
4 80
IIUC, you want to get the rows the highest two B per A.
You can compute a descending rank per group and keep those ≤ 2.
df[df.groupby('A')['B'].rank('first', ascending=False).le(2)]
Output:
A B V
1 a 1 20
2 a 2 30
3 b 5 40
4 b 7 50
7 c 4 80
10 c 2 110
Try:
df.sort_values(['A', 'B']).groupby(['A']).tail(2)
Output:
A B V
1 a 1 20
2 a 2 30
3 b 5 40
4 b 7 50
10 c 2 110
7 c 4 80
def function1(dd:pd.DataFrame):
return dd.sort_values('B').iloc[-2:,1:]
df.groupby(['A']).apply(function1).droplevel(1)
out
B V
A
a 1 20
a 2 30
b 5 40
b 7 50
c 2 110
c 4 80

How to group dataframe by column and receive new column for every group

I have the following dataframe:
df = pd.DataFrame({'timestamp' : [10,10,10,20,20,20], 'idx': [1,2,3,1,2,3], 'v1' : [1,2,4,5,1,9], 'v2' : [1,2,8,5,1,2]})
timestamp idx v1 v2
0 10 1 1 1
1 10 2 2 2
2 10 3 4 8
3 20 1 5 5
4 20 2 1 1
5 20 3 9 2
I'd like to group data by timestamp and calculate the following cumulative statistic:
np.sum(v1*v2) for every timestamp. I'd like to see the following result:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
I'm trying to do the following:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df.loc[:, 'stat'] = df.groupby('timestamp').apply(calc_some_stat)
But for stat columns I receive all NaN values - what is wrong in my code?
We want groupby transform here not groupby apply:
df['stat'] = (df['v1'] * df['v2']).groupby(df['timestamp']).transform('sum')
If we really want to use the function we need to join back to scale up the aggregated DataFrame:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df = df.join(
df.groupby('timestamp').apply(calc_some_stat)
.rename('stat'), # Needed to use join but also sets the col name
on='timestamp'
)
df:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
The issue is that groupby apply is producing summary information:
timestamp
10 37
20 44
dtype: int64
This does not assign back to the DataFrame naturally as there are only 2 rows when the initial DataFrame has 6. We either need to use join to scale these 2 rows up to align with the original DataFrame, or we can avoid all of this using groupby transform which is designed to produce a:
like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values

Pandas : Apply weights to another column, for certain ids only

Let's take this sample dataframe and this list of ids :
df=pd.DataFrame({'Id':['A','A','A','B','C','C','D','D'], 'Weight':[50,20,30,1,2,8,3,2], 'Value':[100,100,100,10,20,20,30,30]})
Id Weight Value
0 A 50 100
1 A 20 100
2 A 30 100
3 B 1 10
4 C 2 20
5 C 8 20
6 D 3 30
7 D 2 30
L = ['A','C']
Value column has same values for each id in Id column. For the specific ids of L, I would like to apply the weights of Weight column to Value column. I am currently doing the following way but it is extremely slow with my real big dataframe :
for i in L :
df.loc[df["Id"]==i,"Value"] = (df.loc[df["Id"]==i,"Value"] * df.loc[df["Id"]==i,"Weight"] /
df[df["Id"]==i]["Weight"].sum())
How please could I do that efficiently ?
Expected output :
Id Weight Value
0 A 50 50
1 A 20 20
2 A 30 30
3 B 1 10
4 C 2 4
5 C 8 16
6 D 3 30
7 D 2 30
Idea is working only for filtered rows by Series.isin with GroupBy.transform and sum for sum per groups with same size like original DataFrame:
L = ['A','C']
m = df['Id'].isin(L)
df1 = df[m].copy()
s = df1.groupby('Id')['Weight'].transform('sum')
df.loc[m, 'Value'] = df1['Value'].mul(df1['Weight']).div(s)
print (df)
Id Weight Value
0 A 50 50.0
1 A 20 20.0
2 A 30 30.0
3 B 1 10.0
4 C 2 4.0
5 C 8 16.0
6 D 3 30.0
7 D 2 30.0

Pandas Merge Columns with Priority

My input dataframe;
MinA MinB MaxA MaxB
0 1 2 5 7
1 1 0 8 6
2 2 15 15
3 3
4 10
I want to merge "min" and "max" columns amongst themselves with priority (A columns have more priority than B columns).
If both columns are null, they should have default values, for min=0 for max=100.
Desired output is;
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 15 15 2 15
3 3 3 100
4 10 0 10
Could you please help me about this?
This can be accomplished using mask. With your data that would look like the following:
df = pd.DataFrame({
'MinA': [1,1,2,None,None],
'MinB': [2,0,None,3,None],
'MaxA': [5,8,15,None,None],
'MaxB': [7,6,15,None,10],
})
# Create new Column, using A as the base, if it is Nan, then use B.
# Then do the same again using specified values
df['Min'] = df['MinA'].mask(pd.isna, df['MinB']).mask(pd.isna, 0)
df['Max'] = df['MaxA'].mask(pd.isna, df['MaxB']).mask(pd.isna, 100)
The above would result in the desired output:
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 NaN 15 15 2 15
3 NaN 3 NaN NaN 3 100
4 NaN NaN NaN 10 0 10
Just use fillna() will be fine.
df['Min'] = df['MinA'].fillna(df['MinB']).fillna(0)
df['Max'] = df['MaxA'].fillna(df['MaxB']).fillna(100)

Sort pandas grouped cols in line

I am working on some machine learning task and I want change each line from "numbered objects" to "sorted by some attrs objects".
For example, I have 5 heroes in 2 teams represented by theirs stats (dN_%stat% and rN_%stat%), and what I want is to sort heroes in each team by stats numbered 3,4,0,2 so the first one is strongest and so on.
Here is my current code, but it is very slow, so I want to use native pandas objects and operations:
def sort_heroes(df):
for match_id in df.index:
for team in ['r', 'd']:
heroes = []
for n in range(1,6):
heroes.append(
[df.ix[match_id, '%s%s_%s' % (team, n, stat)]
for stat in stats])
heroes.sort(key=lambda x: (x[3], x[4], x[0], x[2]))
for n in range(1,6):
for i, stat in enumerate(stats):
df.ix[match_id, '%s%s_%s' %
(team, n, stat)] = heroes[n - 1][i]
Short example with not full but useful data representation:
match_id r1_xp r1_gold r2_xp r2_gold r3_xp r3_gold d1_xp d1_gold d2_xp d2_gold
1 10 20 100 10 5000 300 0 0 15 5
2 1 1 1000 80 100 13 200 87 311 67
What I want is to sort those columns by groups with prefix (rN_ and dN_) firstly by gold then by xp
match_id r1_xp r1_gold r2_xp r2_gold r3_xp r3_gold d1_xp d1_gold d2_xp d2_gold
1 5000 300 10 20 100 20 15 5 0 0
2 1000 80 100 13 1 1 200 87 311 67
You can use:
df.set_index('match_id', inplace=True)
#create MultiIndex with 3 levels
arr = df.columns.str.extract('([rd])(\d*)_(.*)', expand=True).T.values
df.columns = pd.MultiIndex.from_arrays(arr)
#reshape df, sorting
df = df.stack([0,1]).reset_index().sort_values(['match_id','level_1','gold','xp'],
ascending=[True,False,False,False])
print (df)
match_id level_1 level_2 gold xp
4 1 r 3 300.0 5000.0
2 1 r 1 20.0 10.0
3 1 r 2 10.0 100.0
1 1 d 2 5.0 15.0
0 1 d 1 0.0 0.0
8 2 r 2 80.0 1000.0
9 2 r 3 13.0 100.0
7 2 r 1 1.0 1.0
5 2 d 1 87.0 200.0
6 2 d 2 67.0 311.0
#asign new values to level 2
df.level_2 = df.groupby(['match_id','level_1']).cumcount().add(1).astype(str)
#get original shape
df = df.set_index(['match_id','level_1','level_2']).stack().unstack([1,2,3]).astype(int)
df = df.sort_index(level=[0,1,2], ascending=[False, True, False], axis=1)
#Multiindex in columns to column names
df.columns = ['{}{}_{}'.format(x[0], x[1], x[2]) for x in df.columns]
df.reset_index(inplace=True)
print (df)
match_id r1_xp r1_gold r2_xp r2_gold r3_xp r3_gold d1_xp d1_gold \
0 1 5000 300 10 20 100 10 15 5
1 2 1000 80 100 13 1 1 200 87
d2_xp d2_gold
0 0 0
1 311 67

Categories

Resources