I have the following dataframe:
ID_A ID_B ID_C Value ID_B_Value_Sum
-----------------------------------------------
0 22 5 1 54 208
1 23 5 2 34 208
2 24 6 1 44 268
3 25 6 1 64 268
4 26 5 2 35 208
5 27 7 3 45 229
6 28 7 2 66 229
7 29 8 1 76 161
8 30 8 2 25 161
9 31 6 2 27 268
10 32 5 3 14 208
11 33 5 3 17 208
12 34 6 2 43 268
13 35 6 2 53 268
14 36 8 1 22 161
15 37 7 3 65 229
16 38 7 1 53 229
17 39 8 2 23 161
18 40 8 3 15 161
19 41 6 3 37 268
20 42 5 2 54 208
Each row contains a unique "ID_A", while different rows can have the same "ID_B" and "ID_C". Each row corresponds to its own "Value", where this "Value" can be the same between different rows. The "ID_B_Value_Sum" column contains the sums of all values from the "Value" column for all rows containing the same "ID_B". Calculating this sum is straightforward with python and pandas.
What I want to do is, for each row, take the "ID_B_Value_Sum" column, but subtract all values corresponding to rows containing the same "ID_C", exclusive of the target row. For example, taking "ID_B" = 6, we see the sum of all the "Value" values from this "ID_B" = 6 group = 268, as shown in all corresponding rows in the "ID_B_Value_Sum" column. Now, two of the rows in this group contain "ID_C" = 1, three rows in this group contain "ID_C" = 2, and one row in this group contain "ID_C" = 3. Starting with row 2, with "ID_C" = 1, this means taking the corresponding "ID_B_Value_Sum" value and subtracting the "Value" values from all other rows containing both "ID_B" = 6 and "ID_C = 1", exclusive of the target row. And so for row 2 I take 268 - 64 = 204. And for another example, for row 4, this means 208 - 34 - 54 = 120. And another example, for row 7, this means 161 - 22 = 139. These new values will go in a new "Value_Sum_New" column for each row.
And so I want to produce the following output dataframe:
ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New
---------------------------------------------------------------
0 22 5 1 54 208 XX
1 23 5 2 34 208 XX
2 24 6 1 44 268 204
3 25 6 1 64 268 XX
4 26 5 2 35 208 120
5 27 7 3 45 229 XX
6 28 7 2 66 229 XX
7 29 8 1 76 161 139
8 30 8 2 25 161 XX
9 31 6 2 27 268 XX
10 32 5 3 14 208 XX
11 33 5 3 17 208 XX
12 34 6 2 43 268 XX
13 35 6 2 53 268 XX
14 36 8 1 22 161 XX
15 37 7 3 65 229 XX
16 38 7 1 53 229 XX
17 39 8 2 23 161 XX
18 40 8 3 15 161 XX
19 41 6 3 37 268 XX
20 42 5 2 54 208 XX
What I am having trouble with conceptualizing is how to, for each row, group together all columns with the same "ID_B" and then group together all of those rows and sub-group all rows with the same "ID_C" and subtract their sum from the "Value" of the target row, but still including the "Value" from the target row, to create the final "Value_Sum_New". It seems like so many actions and sub-actions to take and I am confused with how to approach this in a simple and streamlined manner, as I am confused with how to organize and order the workflow. How might I approach calculating this sum in python?
IIUC, you need:
df['Value_Sum_New'] = (df['ID_B_Value_Sum']
- df.groupby(['ID_B', 'ID_C'])['Value'].transform('sum')
+ df['Value']
)
output:
ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New
0 22 5 1 54 208 208
1 23 5 2 34 208 119
2 24 6 1 44 268 204
3 25 6 1 64 268 224
4 26 5 2 35 208 120
5 27 7 3 45 229 164
6 28 7 2 66 229 229
7 29 8 1 76 161 139
8 30 8 2 25 161 138
9 31 6 2 27 268 172
10 32 5 3 14 208 191
11 33 5 3 17 208 194
12 34 6 2 43 268 188
13 35 6 2 53 268 198
14 36 8 1 22 161 85
15 37 7 3 65 229 184
16 38 7 1 53 229 229
17 39 8 2 23 161 136
18 40 8 3 15 161 161
19 41 6 3 37 268 268
20 42 5 2 54 208 139
explanation
As you said, computing a sum per group is easy in pandas. You can actually compute ID_B_Value_Sum with:
df['ID_B_Value_Sum'] = df.groupby('ID_B')['Value'].transform('sum')
Now we do the same for groups of ID_B + ID_C, we subtract it from ID_B_Value_Sum, and as we want to exclude only the other rows in the group, we add back the row Value itself.
I have two dataframes like df1, df2 below.
I would like to:
filter df1: that is, remove rows and columns, so that it has the same index elements and columns as df2. The elements within the table for the columns and rows kept should be not be modified.
In addition, I would like to organize the rows and columns of this 'filtered' dataframe so that it has rows and columns in the same order as df2.
Dataframe df1 is:
index
x_3
x_1
x_2
10
110
126
112
11
131
140
143
12
130
128
116
13
118
150
125
14
102
117
110
15
103
105
148
16
116
114
114
17
120
132
110
..and a second dataframe (df2) like:
index
x_1
x_2
x_3
10
1
1
5
11
4
1
2
14
2
2
4
15
1
2
1
16
2
4
1
The final result would be df3, that is:
index
x_1
x_2
x_3
10
126
112
110
11
140
143
131
14
117
110
102
15
105
148
103
16
114
114
116
Any insights?
You can use .reindex_like to conform the index and columns of df1 according to index and columns of df2:
df3 = df1.reindex_like(df2)
>>> df3
x_1 x_2 x_3
index
10 126 112 110
11 140 143 131
14 117 110 102
15 105 148 103
16 114 114 116
df1.loc[df2.index, df2.columns]
Shubham's answer is quite pythonic. Using loc is also a simple way to go from first principles.
Consider I have the below:
Dataframe:
id endId startId ownerId value
1 50 50 10 105
2 51 50 10 240
3 52 50 10 420
4 53 53 10 470
5 40 40 11 320
6 41 40 11 18
7 55 55 12 50
8 57 55 12 412
9 59 55 12 398
10 60 57 12 320
What I would like to do is to sum all the "value" columns where the endId is between the current startId and the current endId for the same ownerId.
Output should be:
id endId startId ownerId value output
1 50 50 10 105 105 # Nothing between 50 and 50
2 51 50 10 240 345 # Found 1 record (endId with id 1)
3 52 50 10 420 765 # Found 2 records (endId with id 1 and 2)
4 53 53 10 470 470 # Nothing else between 53 and 53
5 40 40 11 320 320 # Reset because Owner is different
6 41 40 11 18 338 # Found 1 record (endId with id 5)
7 55 55 12 50 50 # ...
8 57 55 12 412 462
9 59 55 12 398 860
10 60 57 12 320 1130 # Found 3 records between 57 and 60 (endId with id 8, 9 and 10)
I tried to use diff, groupby.cumsum, etc. but I cannot get what I need...
I would use numpy broadcasting to identify the rows that you are looking for:
# Create new df with ownerId as index
df2=df.set_index('ownerId')
df2['output']=0
# Loop over the various ownerIds
for k in df2.index:
refend=df2.loc[k,'endId'].values
refstart=df2.loc[k,'startId'].values
# Identify values matching the condition
i,j=np.where((refend[:,None]<=refend)&(refend[:,None]>=refstart))
# Groupby and sum
dfres=pd.concat([df2.loc[k].iloc[j].endId.reset_index(drop=True),
df2.loc[k].iloc[i].value.reset_index(drop=True)],
axis=1).groupby('endId').sum()
df2.loc[k,'output']=dfres.value.values
# reset index
df2.reset_index(inplace=True)
the output is:
ownerId id endId startId value output
0 10 1 50 50 105 105
1 10 2 51 50 240 345
2 10 3 52 50 420 765
3 10 4 53 53 470 470
4 11 5 40 40 320 320
5 11 6 41 40 18 338
6 12 7 55 55 50 50
7 12 8 57 55 412 462
8 12 9 59 55 398 860
9 12 10 60 57 320 1130
Edit
You can avoid the avoid the for-loop with the following:
refend=df.loc[:,'endId'].values
refstart=df.loc[:,'startId'].values
i,j=np.where((refend[:,None]<=refend)&(refend[:,None]>=refstart))
dfres=pd.concat([df.iloc[j].endId.reset_index(drop=True),
df.loc[:,['ownerId','value']].iloc[i].reset_index(drop=True)],
axis=1).groupby(['ownerId','endId']).sum()
df['output']=dfres.value.values
I made a copy of df to df2, to keep the original data.
I suggest you to break the task in two steps:
#change everything
df2['output'] = df.groupby('ownerId')['value'].cumsum()
#check and update if it applies
df2['output'] = np.where((df2['endId']<= df['startId']),
df2['value'], #copy value from
df2['output']) #place value into
print(df2)
id endId startId ownerId value output
0 1 50 50 10 105 105
1 2 51 50 10 240 345
2 3 52 50 10 420 765
3 4 53 53 10 470 470
4 5 40 40 11 320 320
5 6 41 40 11 18 338
6 7 55 55 12 50 50
7 8 57 55 12 412 462
8 9 59 55 12 398 860
9 10 60 57 12 320 1180
Print of the logic:
I am sorry people, but I still don't get it.
For ownerId 10 and 11 the record where endId and startId are sharing the same value is being counted on the accumulative sum.
And it seems to be ok. But for some reason you are saying that the same rule doesn't apply to ownerId 12.
I understand that id from 7 to 10 should be considered. The pattern seems to be to not count the values when endId and startId
matches on the highest value, it happens on id 4.
I have a Pandas DataFrame df which looks as follows:
ID Timestamp x y
1 10 322 222
1 12 234 542
1 14 22 523
2 55 222 76
2 56 23 87
2 58 322 5436
3 100 322 345
3 150 22 243
3 160 12 765
3 170 78 65
Now, I would like to keep all rows where the timestamp is between 12 and 155. This I could do by df[df["timestamp"] >= 12 & df["timestamp"] <= 155]. But I would like to have only rows included where all timestamps in the corresponding ID group are within the range. So in the example above it should result in the following dataframe:
ID Timestamp x y
2 55 222 76
2 56 23 87
2 58 322 5436
For ID == 1 and ID == 3 not all timestamps of the rows are in the range that's why they are not included.
How can this be done?
You can combine groupby("ID") and filter:
df.groupby("ID").filter(lambda x: x.Timestamp.between(12, 155).all())
ID Timestamp x y
3 2 55 222 76
4 2 56 23 87
5 2 58 322 5436
Use transform with groupby and using all() to check if all items in the group matches the condition:
df[df.groupby('ID').Timestamp.transform(lambda x: x.between(12,155).all())]
ID Timestamp x y
3 2 55 222 76
4 2 56 23 87
5 2 58 322 5436
I have a Dataframe containing data that looks like below.
p,g,a,s,v
15,196,1399,16,5
15,196,948,5,1
15,196,1894,5,1
15,196,1616,5,1
15,196,1742,3,1
15,196,1742,4,4
15,196,1742,5,1
15,195,732,9,2
15,195,1765,11,7
15,196,1815,9,1
15,196,1399,11,8
15,196,1958,0,1
15,195,767,9,1
15,195,1765,11,8
15,195,886,9,1
15,195,1765,11,9
15,196,1958,5,1
15,196,1697,1,1
15,196,1697,4,1
Given multiple entries that have the same p, g, a, and s, I need to drop all but the one with the highest v. The reason is that the original source of this data is a kind of event log, and each line corresponds to a "new total". If it matters, the source data is ordered by time and includes a timestamp index, which I removed for brevity. The entry with the latest date would be the same as the entry with the highest v, as v only increases.
Pulling an example out of the above data, given this:
p,g,a,s,v
15,195,1765,11,7
15,195,1765,11,8
15,195,1765,11,9
I need to drop the first two rows and keep the last one.
If I understand correctly I think you want the following, this performs a groupby on your cols of interest and then takes the max value of column 'v' and we then call reset_index:
In [103]:
df.groupby(['p', 'g', 'a', 's'])['v'].max().reset_index()
Out[103]:
p g a s v
0 15 195 732 9 2
1 15 195 767 9 1
2 15 195 886 9 1
3 15 195 1765 11 9
4 15 196 948 5 1
5 15 196 1399 11 8
6 15 196 1399 16 5
7 15 196 1616 5 1
8 15 196 1697 1 1
9 15 196 1697 4 1
10 15 196 1742 3 1
11 15 196 1742 4 4
12 15 196 1742 5 1
13 15 196 1815 9 1
14 15 196 1894 5 1
15 15 196 1958 0 1
16 15 196 1958 5 1