Selecting rows which match condition of group - python

I have a Pandas DataFrame df which looks as follows:
ID Timestamp x y
1 10 322 222
1 12 234 542
1 14 22 523
2 55 222 76
2 56 23 87
2 58 322 5436
3 100 322 345
3 150 22 243
3 160 12 765
3 170 78 65
Now, I would like to keep all rows where the timestamp is between 12 and 155. This I could do by df[df["timestamp"] >= 12 & df["timestamp"] <= 155]. But I would like to have only rows included where all timestamps in the corresponding ID group are within the range. So in the example above it should result in the following dataframe:
ID Timestamp x y
2 55 222 76
2 56 23 87
2 58 322 5436
For ID == 1 and ID == 3 not all timestamps of the rows are in the range that's why they are not included.
How can this be done?

You can combine groupby("ID") and filter:
df.groupby("ID").filter(lambda x: x.Timestamp.between(12, 155).all())
ID Timestamp x y
3 2 55 222 76
4 2 56 23 87
5 2 58 322 5436

Use transform with groupby and using all() to check if all items in the group matches the condition:
df[df.groupby('ID').Timestamp.transform(lambda x: x.between(12,155).all())]
ID Timestamp x y
3 2 55 222 76
4 2 56 23 87
5 2 58 322 5436

Related

How to delete rows which has nan or empty value in SPECIFIC column?

I have a dataframe which has nan or empty cell in specific column for example column index 2. unfortunately I don't have subset. I just have index. I want to delete the rows which has this features. in stackoverflow there are too many soluntions which are using subset
This is the dataframe for example:
12 125 36 45 665
15 212 12 65 62
65 9 nan 98 84
21 54 78 5 654
211 65 58 26 65
...
output:
12 125 36 45 665
15 212 12 65 62
21 54 78 5 654
211 65 58 26 65
If need test third column (with index=2) use boolean indexing if nan is missing value np.nan or string nan:
idx = 2
df1 = df[df.iloc[:, idx].notna() & df.iloc[:, idx].ne('nan')]
#if no value is empty string or nan string or missing value NaN/None
#df1 = df[df.iloc[:, idx].notna() & ~df.iloc[:, idx].isin(['nan',''])]
print (df1)
0 1 2 3 4
0 12 125 36.0 45 665
1 15 212 12.0 65 62
3 21 54 78.0 5 654
4 211 65 58.0 26 65
If nans are missing values:
df1 = df.dropna(subset=df.columns[[idx]])
print (df1)
0 1 2 3 4
0 12 125 36.0 45 665
1 15 212 12.0 65 62
3 21 54 78.0 5 654
4 211 65 58.0 26 65
Not sure what you mean by
there are too many soluntions which are using subset
but the way to do this would be
df[~df.isna().any(axis=1)]
You can use notnull()
df = df.loc[df[df.columns[idx]].notnull()]

How to group and sum rows by ID and subtract from group of rows with same ID? [python]

I have the following dataframe:
ID_A ID_B ID_C Value ID_B_Value_Sum
-----------------------------------------------
0 22 5 1 54 208
1 23 5 2 34 208
2 24 6 1 44 268
3 25 6 1 64 268
4 26 5 2 35 208
5 27 7 3 45 229
6 28 7 2 66 229
7 29 8 1 76 161
8 30 8 2 25 161
9 31 6 2 27 268
10 32 5 3 14 208
11 33 5 3 17 208
12 34 6 2 43 268
13 35 6 2 53 268
14 36 8 1 22 161
15 37 7 3 65 229
16 38 7 1 53 229
17 39 8 2 23 161
18 40 8 3 15 161
19 41 6 3 37 268
20 42 5 2 54 208
Each row contains a unique "ID_A", while different rows can have the same "ID_B" and "ID_C". Each row corresponds to its own "Value", where this "Value" can be the same between different rows. The "ID_B_Value_Sum" column contains the sums of all values from the "Value" column for all rows containing the same "ID_B". Calculating this sum is straightforward with python and pandas.
What I want to do is, for each row, take the "ID_B_Value_Sum" column, but subtract all values corresponding to rows containing the same "ID_C", exclusive of the target row. For example, taking "ID_B" = 6, we see the sum of all the "Value" values from this "ID_B" = 6 group = 268, as shown in all corresponding rows in the "ID_B_Value_Sum" column. Now, two of the rows in this group contain "ID_C" = 1, three rows in this group contain "ID_C" = 2, and one row in this group contain "ID_C" = 3. Starting with row 2, with "ID_C" = 1, this means taking the corresponding "ID_B_Value_Sum" value and subtracting the "Value" values from all other rows containing both "ID_B" = 6 and "ID_C = 1", exclusive of the target row. And so for row 2 I take 268 - 64 = 204. And for another example, for row 4, this means 208 - 34 - 54 = 120. And another example, for row 7, this means 161 - 22 = 139. These new values will go in a new "Value_Sum_New" column for each row.
And so I want to produce the following output dataframe:
ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New
---------------------------------------------------------------
0 22 5 1 54 208 XX
1 23 5 2 34 208 XX
2 24 6 1 44 268 204
3 25 6 1 64 268 XX
4 26 5 2 35 208 120
5 27 7 3 45 229 XX
6 28 7 2 66 229 XX
7 29 8 1 76 161 139
8 30 8 2 25 161 XX
9 31 6 2 27 268 XX
10 32 5 3 14 208 XX
11 33 5 3 17 208 XX
12 34 6 2 43 268 XX
13 35 6 2 53 268 XX
14 36 8 1 22 161 XX
15 37 7 3 65 229 XX
16 38 7 1 53 229 XX
17 39 8 2 23 161 XX
18 40 8 3 15 161 XX
19 41 6 3 37 268 XX
20 42 5 2 54 208 XX
What I am having trouble with conceptualizing is how to, for each row, group together all columns with the same "ID_B" and then group together all of those rows and sub-group all rows with the same "ID_C" and subtract their sum from the "Value" of the target row, but still including the "Value" from the target row, to create the final "Value_Sum_New". It seems like so many actions and sub-actions to take and I am confused with how to approach this in a simple and streamlined manner, as I am confused with how to organize and order the workflow. How might I approach calculating this sum in python?
IIUC, you need:
df['Value_Sum_New'] = (df['ID_B_Value_Sum']
- df.groupby(['ID_B', 'ID_C'])['Value'].transform('sum')
+ df['Value']
)
output:
ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New
0 22 5 1 54 208 208
1 23 5 2 34 208 119
2 24 6 1 44 268 204
3 25 6 1 64 268 224
4 26 5 2 35 208 120
5 27 7 3 45 229 164
6 28 7 2 66 229 229
7 29 8 1 76 161 139
8 30 8 2 25 161 138
9 31 6 2 27 268 172
10 32 5 3 14 208 191
11 33 5 3 17 208 194
12 34 6 2 43 268 188
13 35 6 2 53 268 198
14 36 8 1 22 161 85
15 37 7 3 65 229 184
16 38 7 1 53 229 229
17 39 8 2 23 161 136
18 40 8 3 15 161 161
19 41 6 3 37 268 268
20 42 5 2 54 208 139
explanation
As you said, computing a sum per group is easy in pandas. You can actually compute ID_B_Value_Sum with:
df['ID_B_Value_Sum'] = df.groupby('ID_B')['Value'].transform('sum')
Now we do the same for groups of ID_B + ID_C, we subtract it from ID_B_Value_Sum, and as we want to exclude only the other rows in the group, we add back the row Value itself.

Sum row values of all columns where column names meet string match condition

I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])
Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass

pytest assert if a column is ascending or descending within another already sorted group

I am running the following code:
import numpy as np
import pandas as pd
dfTestExample = pd.DataFrame(np.random.randint(0,100,size=(1000, 4)), columns=list('ABCD'))
dfTestExample = dfTestExample.sort_values(["A", "B"], ascending = [True, False])
dfTestExample.head(10)
which produces
A B C D
303 0 84 13 96
728 0 43 48 32
558 0 35 49 49
286 0 34 17 4
652 0 29 53 4
292 0 18 62 29
139 0 17 63 99
718 1 91 6 48
611 1 83 19 75
208 1 80 35 73
dfTestExample.A.is_monotonic
True
dfTestExample.B.is_monotonic
False
How do I check if the column B is also montonic for all values of A?
You can use groupby to split the dataframe into separate groups for each value of A:
monotonic = True
for group in df.groupby(['A']):
b = group[1].B
if not b.is_monotonic and not b.is_monotonic_decreasing:
monotonic = False
print(monotonic)
groupby gives you a DataFrameGroupBy object. If you iterate over that object, you get tuples of index and DataFrame objects, and you can handle these separately. In your case the grouped DataFrame objects will look like:
A B C D
303 0 84 13 96
728 0 43 48 32
558 0 35 49 49
286 0 34 17 4
652 0 29 53 4
292 0 18 62 29
139 0 17 63 99
and:
A B C D
718 1 91 6 48
611 1 83 19 75
208 1 80 35 73
Note that if you want to know if a dataset is monotonically increasing or decreasing, you have to check both, as shown in the example.

CumSum column only if previous ids are between 2 values

Consider I have the below:
Dataframe:
id endId startId ownerId value
1 50 50 10 105
2 51 50 10 240
3 52 50 10 420
4 53 53 10 470
5 40 40 11 320
6 41 40 11 18
7 55 55 12 50
8 57 55 12 412
9 59 55 12 398
10 60 57 12 320
What I would like to do is to sum all the "value" columns where the endId is between the current startId and the current endId for the same ownerId.
Output should be:
id endId startId ownerId value output
1 50 50 10 105 105 # Nothing between 50 and 50
2 51 50 10 240 345 # Found 1 record (endId with id 1)
3 52 50 10 420 765 # Found 2 records (endId with id 1 and 2)
4 53 53 10 470 470 # Nothing else between 53 and 53
5 40 40 11 320 320 # Reset because Owner is different
6 41 40 11 18 338 # Found 1 record (endId with id 5)
7 55 55 12 50 50 # ...
8 57 55 12 412 462
9 59 55 12 398 860
10 60 57 12 320 1130 # Found 3 records between 57 and 60 (endId with id 8, 9 and 10)
I tried to use diff, groupby.cumsum, etc. but I cannot get what I need...
I would use numpy broadcasting to identify the rows that you are looking for:
# Create new df with ownerId as index
df2=df.set_index('ownerId')
df2['output']=0
# Loop over the various ownerIds
for k in df2.index:
refend=df2.loc[k,'endId'].values
refstart=df2.loc[k,'startId'].values
# Identify values matching the condition
i,j=np.where((refend[:,None]<=refend)&(refend[:,None]>=refstart))
# Groupby and sum
dfres=pd.concat([df2.loc[k].iloc[j].endId.reset_index(drop=True),
df2.loc[k].iloc[i].value.reset_index(drop=True)],
axis=1).groupby('endId').sum()
df2.loc[k,'output']=dfres.value.values
# reset index
df2.reset_index(inplace=True)
the output is:
ownerId id endId startId value output
0 10 1 50 50 105 105
1 10 2 51 50 240 345
2 10 3 52 50 420 765
3 10 4 53 53 470 470
4 11 5 40 40 320 320
5 11 6 41 40 18 338
6 12 7 55 55 50 50
7 12 8 57 55 412 462
8 12 9 59 55 398 860
9 12 10 60 57 320 1130
Edit
You can avoid the avoid the for-loop with the following:
refend=df.loc[:,'endId'].values
refstart=df.loc[:,'startId'].values
i,j=np.where((refend[:,None]<=refend)&(refend[:,None]>=refstart))
dfres=pd.concat([df.iloc[j].endId.reset_index(drop=True),
df.loc[:,['ownerId','value']].iloc[i].reset_index(drop=True)],
axis=1).groupby(['ownerId','endId']).sum()
df['output']=dfres.value.values
I made a copy of df to df2, to keep the original data.
I suggest you to break the task in two steps:
#change everything
df2['output'] = df.groupby('ownerId')['value'].cumsum()
#check and update if it applies
df2['output'] = np.where((df2['endId']<= df['startId']),
df2['value'], #copy value from
df2['output']) #place value into
print(df2)
id endId startId ownerId value output
0 1 50 50 10 105 105
1 2 51 50 10 240 345
2 3 52 50 10 420 765
3 4 53 53 10 470 470
4 5 40 40 11 320 320
5 6 41 40 11 18 338
6 7 55 55 12 50 50
7 8 57 55 12 412 462
8 9 59 55 12 398 860
9 10 60 57 12 320 1180
Print of the logic:
I am sorry people, but I still don't get it.
For ownerId 10 and 11 the record where endId and startId are sharing the same value is being counted on the accumulative sum.
And it seems to be ok. But for some reason you are saying that the same rule doesn't apply to ownerId 12.
I understand that id from 7 to 10 should be considered. The pattern seems to be to not count the values when endId and startId
matches on the highest value, it happens on id 4.

Categories

Resources