CumSum column only if previous ids are between 2 values - python

Consider I have the below:
Dataframe:
id endId startId ownerId value
1 50 50 10 105
2 51 50 10 240
3 52 50 10 420
4 53 53 10 470
5 40 40 11 320
6 41 40 11 18
7 55 55 12 50
8 57 55 12 412
9 59 55 12 398
10 60 57 12 320
What I would like to do is to sum all the "value" columns where the endId is between the current startId and the current endId for the same ownerId.
Output should be:
id endId startId ownerId value output
1 50 50 10 105 105 # Nothing between 50 and 50
2 51 50 10 240 345 # Found 1 record (endId with id 1)
3 52 50 10 420 765 # Found 2 records (endId with id 1 and 2)
4 53 53 10 470 470 # Nothing else between 53 and 53
5 40 40 11 320 320 # Reset because Owner is different
6 41 40 11 18 338 # Found 1 record (endId with id 5)
7 55 55 12 50 50 # ...
8 57 55 12 412 462
9 59 55 12 398 860
10 60 57 12 320 1130 # Found 3 records between 57 and 60 (endId with id 8, 9 and 10)
I tried to use diff, groupby.cumsum, etc. but I cannot get what I need...

I would use numpy broadcasting to identify the rows that you are looking for:
# Create new df with ownerId as index
df2=df.set_index('ownerId')
df2['output']=0
# Loop over the various ownerIds
for k in df2.index:
refend=df2.loc[k,'endId'].values
refstart=df2.loc[k,'startId'].values
# Identify values matching the condition
i,j=np.where((refend[:,None]<=refend)&(refend[:,None]>=refstart))
# Groupby and sum
dfres=pd.concat([df2.loc[k].iloc[j].endId.reset_index(drop=True),
df2.loc[k].iloc[i].value.reset_index(drop=True)],
axis=1).groupby('endId').sum()
df2.loc[k,'output']=dfres.value.values
# reset index
df2.reset_index(inplace=True)
the output is:
ownerId id endId startId value output
0 10 1 50 50 105 105
1 10 2 51 50 240 345
2 10 3 52 50 420 765
3 10 4 53 53 470 470
4 11 5 40 40 320 320
5 11 6 41 40 18 338
6 12 7 55 55 50 50
7 12 8 57 55 412 462
8 12 9 59 55 398 860
9 12 10 60 57 320 1130
Edit
You can avoid the avoid the for-loop with the following:
refend=df.loc[:,'endId'].values
refstart=df.loc[:,'startId'].values
i,j=np.where((refend[:,None]<=refend)&(refend[:,None]>=refstart))
dfres=pd.concat([df.iloc[j].endId.reset_index(drop=True),
df.loc[:,['ownerId','value']].iloc[i].reset_index(drop=True)],
axis=1).groupby(['ownerId','endId']).sum()
df['output']=dfres.value.values

I made a copy of df to df2, to keep the original data.
I suggest you to break the task in two steps:
#change everything
df2['output'] = df.groupby('ownerId')['value'].cumsum()
#check and update if it applies
df2['output'] = np.where((df2['endId']<= df['startId']),
df2['value'], #copy value from
df2['output']) #place value into
print(df2)
id endId startId ownerId value output
0 1 50 50 10 105 105
1 2 51 50 10 240 345
2 3 52 50 10 420 765
3 4 53 53 10 470 470
4 5 40 40 11 320 320
5 6 41 40 11 18 338
6 7 55 55 12 50 50
7 8 57 55 12 412 462
8 9 59 55 12 398 860
9 10 60 57 12 320 1180
Print of the logic:
I am sorry people, but I still don't get it.
For ownerId 10 and 11 the record where endId and startId are sharing the same value is being counted on the accumulative sum.
And it seems to be ok. But for some reason you are saying that the same rule doesn't apply to ownerId 12.
I understand that id from 7 to 10 should be considered. The pattern seems to be to not count the values when endId and startId
matches on the highest value, it happens on id 4.

Related

How to group and sum rows by ID and subtract from group of rows with same ID? [python]

I have the following dataframe:
ID_A ID_B ID_C Value ID_B_Value_Sum
-----------------------------------------------
0 22 5 1 54 208
1 23 5 2 34 208
2 24 6 1 44 268
3 25 6 1 64 268
4 26 5 2 35 208
5 27 7 3 45 229
6 28 7 2 66 229
7 29 8 1 76 161
8 30 8 2 25 161
9 31 6 2 27 268
10 32 5 3 14 208
11 33 5 3 17 208
12 34 6 2 43 268
13 35 6 2 53 268
14 36 8 1 22 161
15 37 7 3 65 229
16 38 7 1 53 229
17 39 8 2 23 161
18 40 8 3 15 161
19 41 6 3 37 268
20 42 5 2 54 208
Each row contains a unique "ID_A", while different rows can have the same "ID_B" and "ID_C". Each row corresponds to its own "Value", where this "Value" can be the same between different rows. The "ID_B_Value_Sum" column contains the sums of all values from the "Value" column for all rows containing the same "ID_B". Calculating this sum is straightforward with python and pandas.
What I want to do is, for each row, take the "ID_B_Value_Sum" column, but subtract all values corresponding to rows containing the same "ID_C", exclusive of the target row. For example, taking "ID_B" = 6, we see the sum of all the "Value" values from this "ID_B" = 6 group = 268, as shown in all corresponding rows in the "ID_B_Value_Sum" column. Now, two of the rows in this group contain "ID_C" = 1, three rows in this group contain "ID_C" = 2, and one row in this group contain "ID_C" = 3. Starting with row 2, with "ID_C" = 1, this means taking the corresponding "ID_B_Value_Sum" value and subtracting the "Value" values from all other rows containing both "ID_B" = 6 and "ID_C = 1", exclusive of the target row. And so for row 2 I take 268 - 64 = 204. And for another example, for row 4, this means 208 - 34 - 54 = 120. And another example, for row 7, this means 161 - 22 = 139. These new values will go in a new "Value_Sum_New" column for each row.
And so I want to produce the following output dataframe:
ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New
---------------------------------------------------------------
0 22 5 1 54 208 XX
1 23 5 2 34 208 XX
2 24 6 1 44 268 204
3 25 6 1 64 268 XX
4 26 5 2 35 208 120
5 27 7 3 45 229 XX
6 28 7 2 66 229 XX
7 29 8 1 76 161 139
8 30 8 2 25 161 XX
9 31 6 2 27 268 XX
10 32 5 3 14 208 XX
11 33 5 3 17 208 XX
12 34 6 2 43 268 XX
13 35 6 2 53 268 XX
14 36 8 1 22 161 XX
15 37 7 3 65 229 XX
16 38 7 1 53 229 XX
17 39 8 2 23 161 XX
18 40 8 3 15 161 XX
19 41 6 3 37 268 XX
20 42 5 2 54 208 XX
What I am having trouble with conceptualizing is how to, for each row, group together all columns with the same "ID_B" and then group together all of those rows and sub-group all rows with the same "ID_C" and subtract their sum from the "Value" of the target row, but still including the "Value" from the target row, to create the final "Value_Sum_New". It seems like so many actions and sub-actions to take and I am confused with how to approach this in a simple and streamlined manner, as I am confused with how to organize and order the workflow. How might I approach calculating this sum in python?
IIUC, you need:
df['Value_Sum_New'] = (df['ID_B_Value_Sum']
- df.groupby(['ID_B', 'ID_C'])['Value'].transform('sum')
+ df['Value']
)
output:
ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New
0 22 5 1 54 208 208
1 23 5 2 34 208 119
2 24 6 1 44 268 204
3 25 6 1 64 268 224
4 26 5 2 35 208 120
5 27 7 3 45 229 164
6 28 7 2 66 229 229
7 29 8 1 76 161 139
8 30 8 2 25 161 138
9 31 6 2 27 268 172
10 32 5 3 14 208 191
11 33 5 3 17 208 194
12 34 6 2 43 268 188
13 35 6 2 53 268 198
14 36 8 1 22 161 85
15 37 7 3 65 229 184
16 38 7 1 53 229 229
17 39 8 2 23 161 136
18 40 8 3 15 161 161
19 41 6 3 37 268 268
20 42 5 2 54 208 139
explanation
As you said, computing a sum per group is easy in pandas. You can actually compute ID_B_Value_Sum with:
df['ID_B_Value_Sum'] = df.groupby('ID_B')['Value'].transform('sum')
Now we do the same for groups of ID_B + ID_C, we subtract it from ID_B_Value_Sum, and as we want to exclude only the other rows in the group, we add back the row Value itself.

Sum row values of all columns where column names meet string match condition

I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])
Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass

Selecting rows which match condition of group

I have a Pandas DataFrame df which looks as follows:
ID Timestamp x y
1 10 322 222
1 12 234 542
1 14 22 523
2 55 222 76
2 56 23 87
2 58 322 5436
3 100 322 345
3 150 22 243
3 160 12 765
3 170 78 65
Now, I would like to keep all rows where the timestamp is between 12 and 155. This I could do by df[df["timestamp"] >= 12 & df["timestamp"] <= 155]. But I would like to have only rows included where all timestamps in the corresponding ID group are within the range. So in the example above it should result in the following dataframe:
ID Timestamp x y
2 55 222 76
2 56 23 87
2 58 322 5436
For ID == 1 and ID == 3 not all timestamps of the rows are in the range that's why they are not included.
How can this be done?
You can combine groupby("ID") and filter:
df.groupby("ID").filter(lambda x: x.Timestamp.between(12, 155).all())
ID Timestamp x y
3 2 55 222 76
4 2 56 23 87
5 2 58 322 5436
Use transform with groupby and using all() to check if all items in the group matches the condition:
df[df.groupby('ID').Timestamp.transform(lambda x: x.between(12,155).all())]
ID Timestamp x y
3 2 55 222 76
4 2 56 23 87
5 2 58 322 5436

Pandas pivot table with mean

I have a pandas data frame, df, that looks like this;
index New Old MAP Limit count
1 93 35 54 > 18 1
2 163 93 116 > 18 1
3 134 78 96 > 18 1
4 117 81 93 > 18 1
5 194 108 136 > 18 1
6 125 57 79 <= 18 1
7 66 39 48 > 18 1
8 120 83 95 > 18 1
9 150 98 115 > 18 1
10 149 99 115 > 18 1
11 148 85 106 > 18 1
12 92 55 67 <= 18 1
13 64 24 37 > 18 1
14 84 53 63 > 18 1
15 99 70 79 > 18 1
I need to create a pivot table that looks like this
Limit <=18 >18
New xx1 xx2
Old xx3 xx4
MAP xx5 xx6
where values xx1, xx2, xx3, xx4, xx5, and xx6 are the mean of New, Old and Map for respective Limit.
How can I achieve this?
I tried the following without success.
table = df.pivot_table('count', index=['New', 'Old', 'MAP'], columns=['Limit'], aggfunc='mean')
Solution
df.groupby('Limit')['New', 'Old', 'MAP'].mean().T
Limit <= 18 > 18
New 108.5 121.615385
Old 56.0 72.769231
MAP 73.0 88.692308

Issue combining columns in Dataframe?

I have the following dataframe:
Obj BIT BIT BIT GAS GAS GAS OIL OIL OIL
Date
2007-01-03 18 7 0 184 35 2 52 14 0
2007-01-09 43 3 0 249 35 2 68 11 1
2007-01-16 60 6 0 254 35 5 72 13 1
2007-01-23 69 11 1 255 43 2 81 6 0
2007-01-30 74 8 0 263 29 4 69 9 0
2007-02-06 78 6 1 259 34 2 79 6 0
2007-02-14 76 9 1 263 24 2 70 10 1
2007-02-20 85 7 0 241 20 6 72 4 0
2007-02-27 79 6 0 242 35 3 68 7 0
2007-03-06 68 14 0 225 26 2 57 10 1
How can I sum each of the 9 columns into 3 columns. "BIT","GAS" and "OIL"
This is the code for the dataframe which basically just gets me a cross section from a larger df I want:
ABrigsA = ndfAB.xs(['BIT','GAS','OIL'],axis=1)
Any suggestions?
Assuming that you want to sum similarly-named columns, you can use groupby [tutorial docs]:
>>> df.groupby(level=0, axis='columns').sum()
Obj BIT GAS OIL
Date
2007-01-03 25 221 66
2007-01-09 46 286 80
2007-01-16 66 294 86
2007-01-23 81 300 87
2007-01-30 82 296 78
2007-02-06 85 295 85
2007-02-14 86 289 81
2007-02-20 92 267 76
2007-02-27 85 280 75
2007-03-06 82 253 68

Categories

Resources