Python group by only neighbours

Python group by only neighbours - python

I have a dataset consisting of measurements, and my dataframe looks like that:
ID VAL BS ERROR
0 0 0 0
1 1 0 1
2 1 0 1
3 0 0 0
4 11 10 1
5 10 10 0
6 12 10 2
7 11 10 1
8 9 10 -1
9 30 30 0
10 31 30 1
11 29 30 -1
12 10 10 0
13 9 10 -1
14 8 10 -2
15 11 10 1
16 0 0 0
17 1 0 1
18 2 0 2
19 9 10 -1
20 10 10 0
Where VAL is measured value, BS is base(round to nearest 10), and ERROR is a difference between measured value and base.
What I am trying to do is somewhat group by 'BASE' column, but only for neighborhood rows.
So, a resulting dataset will look like that (I will also want to calculate aggregate min and max error for a group, but I guess it will not be a problem)
It is important to keep the order of the groups for this case.
ID BS MIN MAX
0 0 0 1
1 10 -1 2
2 30 -1 1
3 10 -2 1
4 0 0 2
5 10 -1 0

You can find the consecutive groups like this:
df['GROUP'] = (df['BS']!=df['BS'].shift()).cumsum()
Then you group by the GROUP column and aggregate min and max:
df.groupby(['GROUP', 'BS'])['ERROR'].agg(['min', 'max']).reset_index()
The output should be:
GROUP BS min max
0 1 0 0 1
1 2 10 -1 2
2 3 30 -1 1
3 4 10 -2 1
4 5 0 0 2
5 6 10 -1 0

Related

Pandas add column on condition: If value of cell is True set value of largest number in Period to true

I have a pandas dataframe with lets say two columns, for example:
value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0
Now I want to add a third column (new_boolean) with the following criteria:
I specify a period, for this example period = 4.
Now I take a look at all rows where boolean == 1.
new_boolean will be 1 for the maximum value in the last period rows.
For example I have boolean == 1 for row 2. So I look at the last period rows. The values are [1, 5], 5 is the maximum, so the value for new_boolean in row 2 will be one.
Second example: row 8 (value = 7): I get values [7, 4, 12, 9], 12 is the maximum, so the value for new_boolean in the row with value 12 will be 1
result:
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
How can I do this algorithmically?

Compute the rolling max of the 'value' column
>>> rolling_max_value = df.rolling(window=4, min_periods=1)['value'].max()
>>> rolling_max_value
0 1.0
1 5.0
2 5.0
3 5.0
4 9.0
5 12.0
6 12.0
7 12.0
8 12.0
9 8.0
10 17.0
11 17.0
12 17.0
Name: value, dtype: float64
Select only the relevant values, i.e. where 'boolean' = 1
>>> on_values = rolling_max_value[df.boolean == 1].unique()
>>> on_values
array([ 5., 9., 12., 17.])
The rows where 'new_boolean' = 1 are the ones where 'value' belongs to on_values
>>> df['new_boolean'] = df.value.isin(on_values).astype(int)
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
EDIT:
OP raised a good point
Does this also work if I have multiple columns with the same value and they have different booleans?
The previous solution doesn't account for that. To solve this, instead of computing the rolling max, we gather the row labels associated with rolling max values, i.e. the rolling argmaxor idxmax. To my knowledge, Rolling objects don't have an idxmax method, but we can easily compute it via apply.
def idxmax(values):
return values.idxmax()
rolling_idxmax_value = (
df.rolling(min_periods=1, window=4)['value']
.apply(idxmax)
.astype(int)
)
on_idx = rolling_idxmax_value[df.boolean == 1].unique()
df['new_boolean'] = 0
df.loc[on_idx, 'new_boolean'] = 1
Results:
>>> rolling_idxmax_value
0 0
1 1
2 1
3 1
4 4
5 5
6 5
7 5
8 5
9 8
10 10
11 10
12 10
Name: value, dtype: int64
>>> on_idx
[ 1 4 5 10]
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0

I did this in 2 steps, but I think the solution is much clearer:
df = pd.read_csv(StringIO('''
id value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0'''),delim_whitespace=True,index_col=0)
df['new_bool'] = df['value'].rolling(min_periods=1, window=4).max()
df['new_bool'] = df.apply(lambda x: 1 if ((x['value'] == x['new_bool']) & (x['boolean'] == 1)) else 0, axis=1)
df
Result:
value boolean new_bool
id
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 0
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 0
11 15 1 0
12 6 0 0

Pandas - changing rows where less than n subsequent values are equal

I have the following dataframe:
df = pd.DataFrame({"col":[0,0,1,1,1,1,0,0,1,1,0,0,1,1,1,0,1,1,1,1,0,0,0]})
Now I would like to set all the rows equal to zero where less than four 1's appear "in a row", i.e. I would like to have the following resulting DataFrame:
df = pd.DataFrame({"col":[0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0]})
I was not able to find a way to achieve this nicely...

Try with groupby and where:
streaks = df.groupby(df["col"].ne(df["col"].shift()).cumsum()).transform("sum")
output = df.where(streaks.ge(4), 0)
>>> output
col
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 1
17 1
18 1
19 1
20 0
21 0
22 0

We can do
df.loc[df.groupby(df.col.eq(0).cumsum()).transform('count')['col']<5,'col'] = 0
df
Out[77]:
col
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 1
17 1
18 1
19 1
20 0
21 0
22 0

Subtract fixed row value in reference to column value in pandas dataframe

I would like to subtract a fixed row value in rows, in reference to their values in another column.
My data looks like this:
TRACK TIME POSITION_X
0 1 0 12
1 1 30 13
2 1 60 15
3 1 90 11
4 2 0 10
5 2 20 11
6 2 60 13
7 2 90 17
I would like to subtract a fixed row value (WHEN TIME=0) of the POSITION_X column in reference to the TRACK column, and create a new column ("NEW_POSX") with those values. The output should be like this:
TRACK TIME POSITION_X NEW_POSX
0 1 0 12 0
1 1 30 13 1
2 1 60 15 3
3 1 90 11 -1
4 2 0 10 0
5 2 20 11 1
6 2 60 13 3
7 2 90 17 7
I have been using the following code to get this done:
import pandas as pd
data = {'TRACK': [1,1,1,1,2,2,2,2],
'TIME': [0,30,60,90,0,20,60,90],
'POSITION_X': [12,13,15,11,10,11,13,17],
}
df = pd.DataFrame (data, columns = ['TRACK','TIME','POSITION_X'])
df['NEW_POSX']= df.groupby('TRACK')['POSITION_X'].diff().fillna(0).astype(int)
df.head(8)
... but I don't get the desired output. Instead, I get a new column where every row is subtracted by the previous row (according to the "TRACK" column):
TRACK TIME POSITION_X NEW_POSX
0 1 0 12 0
1 1 30 13 1
2 1 60 15 2
3 1 90 11 -4
4 2 0 10 0
5 2 20 11 1
6 2 60 13 2
7 2 90 17 4
can anyone help me with this?

You can use transform and first to get the value at time 0, and then substract it to the 'POSITION_X' column:
s=df.groupby('TRACK')['POSITION_X'].transform('first')
df['NEW_POSX']=df['POSITION_X']-s
#Same as:
#df['NEW_POSX']=df['POSITION_X'].sub(s)
Output:
df
TRACK TIME POSITION_X NEW_POSX
0 1 0 12 0
1 1 30 13 1
2 1 60 15 3
3 1 90 11 -1
4 2 0 10 0
5 2 20 11 1
6 2 60 13 3
7 2 90 17 7

How to downsampling time series data in pandas?

I have a time series in pandas that looks like this (order by id):
id time value
1 0 2
1 1 4
1 2 5
1 3 10
1 4 15
1 5 16
1 6 18
1 7 20
2 15 3
2 16 5
2 17 8
2 18 10
4 6 5
4 7 6
I want downsampling time from 1 minute to 3 minutes for each group id.
And value is a maximum of group (id and 3 minutes).
The output should be like:
id time value
1 0 5
1 1 16
1 2 20
2 0 8
2 1 10
4 0 6
I tried loop it take long time process.
Any idea how to solve this for large dataframe?
Thanks!

You can convert your time series to an actual timedelta, then use resample for a vectorized solution:
t = pd.to_timedelta(df.time, unit='T')
s = df.set_index(t).groupby('id').resample('3T').last().reset_index(drop=True)
s.assign(time=s.groupby('id').cumcount())
id time value
0 1 0 5
1 1 1 16
2 1 2 20
3 2 0 8
4 2 1 10
5 4 0 6

Use np.r_ and .iloc with groupby:
df.groupby('id')['value'].apply(lambda x: x.iloc[np.r_[2:len(x):3,-1]])
Output:
id
1 2 5
5 16
7 20
2 10 8
11 10
4 13 6
Name: value, dtype: int64
Going a little further with column naming etc..
df_out = df.groupby('id')['value']\
.apply(lambda x: x.iloc[np.r_[2:len(x):3,-1]]).reset_index()
df_out.assign(time=df_out.groupby('id').cumcount()).drop('level_1', axis=1)
Output:
id value time
0 1 5 0
1 1 16 1
2 1 20 2
3 2 8 0
4 2 10 1
5 4 6 0

Pandas sum above all possible thresholds

I have a dataset with two risk model scores and observations that have a certain amount of value. Something like this:
import pandas as pd
df = pd.DataFrame(data={'segment':['A','A','A','A','A','A','A','B','B','B','B','B'],
'model1':[9,4,5,2,9,7,7,8,8,5,6,3],
'model2':[9,8,2,4,6,8,8,7,7,7,4,4],
'dollars':[15,10,-5,-7,6,7,-2,5,7,3,-1,-3]},
columns=['segment','model1','model2','dollars'])
print df
segment model1 model2 dollars
0 A 9 9 15
1 A 4 8 10
2 A 5 2 -5
3 A 2 4 -7
4 A 9 6 6
5 A 7 8 7
6 A 7 8 -2
7 B 8 7 5
8 B 8 7 7
9 B 5 7 3
10 B 6 4 -1
11 B 3 4 -3
My goal is to determine the simultaneous risk model thresholds where value is maximized, i.e. a cutoff like (model1 >= X) & (model2 >= Y). The risk-models are both rank-ordered such that higher numbers are lower risk and generally higher value.
I was able to get the desired output using a loop approach:
df_sum = df.groupby(by=['segment','model1','model2'])['dollars'].agg(['sum']).rename(columns={'sum':'dollar_sum'}).reset_index()
df_sum.loc[:,'threshold_sum'] = 0
#this loop works but runs very slowly on my large dataframe
#calculate the sum of dollars for each combination of possible model score thresholds
for row in df_sum.itertuples():
#subset the original df down to just the observations above the given model scores
df_temp = df[((df['model1'] >= getattr(row,'model1')) & (df['model2'] >= getattr(row,'model2')) & (df['segment'] == getattr(row,'segment')))]
#calculate the sum and add it back to the dataframe
df_sum.loc[row.Index,'threshold_sum'] = df_temp['dollars'].sum()
#show the max value for each segment
print df_sum.loc[df_sum.groupby(by=['segment'])['threshold_sum'].idxmax()]
segment model1 model2 dollar_sum threshold_sum
1 A 4 8 10 30
7 B 5 7 3 15
The loop runs incredibly slowly as the size of the dataframe increases. I'm sure there's a faster way to do this (maybe using cumsum() or numpy), but I'm stumped on what it is. Does anyone have a better way to do it? Ideally any code would be easily extendable to n-many risk models and would output all possible combinations of threshold_sum in case I add other optimization criteria down the road.

You'll get some speedup with apply(), using your same approach, but I agree with your hunch, there's probably a faster way.
Here's an apply() solution:
With df_sum as:
df_sum = (df.groupby(['segment','model1','model2'])
.dollars
.sum()
.reset_index()
)
print(df_sum)
segment model1 model2 dollars
0 A 2 4 -7
1 A 4 8 10
2 A 5 2 -5
3 A 7 8 5
4 A 9 6 6
5 A 9 9 15
6 B 3 4 -3
7 B 5 7 3
8 B 6 4 -1
9 B 8 7 12
apply can be combined with groupby:
def get_threshold_sum(row):
return (df.loc[(df.segment == row.segment) &
(df.model1 >= row.model1) &
(df.model2 >= row.model2),
["segment","dollars"]]
.groupby('segment')
.sum()
.dollars
)
thresholds = df_sum.apply(get_threshold_sum, axis=1)
mask = thresholds.idxmax()
df_sum.loc[mask]
segment model1 model2 dollar_sum
1 A 4 8 10
7 B 5 7 3
To see all possible thresholds, just print the thresholds list.

Finally found a non-loop approach, it requires some re-shaping and cumsum().
df['cumsum_dollars'] = df['dollars']
df2 = pd.pivot_table(df,index=['segment','model1','model2'],values=['dollars','cumsum_dollars'],fill_value=0,aggfunc=np.sum)
# descending sort ensures that the cumsum happens in the desired direction
df2 = df2.unstack(fill_value=0).sort_index(ascending=False,axis=0).sort_index(ascending=False,axis=1)
print(df2)
dollars cumsum_dollars
model2 9 8 7 6 4 2 9 8 7 6 4 2
segment model1
B 8 0 0 12 0 0 0 0 0 12 0 0 0
6 0 0 0 0 -1 0 0 0 0 0 -1 0
5 0 0 3 0 0 0 0 0 3 0 0 0
3 0 0 0 0 -3 0 0 0 0 0 -3 0
A 9 15 0 0 6 0 0 15 0 0 6 0 0
7 0 5 0 0 0 0 0 5 0 0 0 0
5 0 0 0 0 0 -5 0 0 0 0 0 -5
4 0 10 0 0 0 0 0 10 0 0 0 0
2 0 0 0 0 -7 0 0 0 0 0 -7 0
From here, take the cumulative sum in both the horizontal and vertical directions using the axis parameter of the cumsum() function.
df2['cumsum_dollars'] = df2['cumsum_dollars'].groupby(level='segment').cumsum(axis=0).cumsum(axis=1)
print(df2)
dollars cumsum_dollars
model2 9 8 7 6 4 2 9 8 7 6 4 2
segment model1
B 8 0 0 12 0 0 0 0 0 12 12 12 12
6 0 0 0 0 -1 0 0 0 12 12 11 11
5 0 0 3 0 0 0 0 0 15 15 14 14
3 0 0 0 0 -3 0 0 0 15 15 11 11
A 9 15 0 0 6 0 0 15 15 15 21 21 21
7 0 5 0 0 0 0 15 20 20 26 26 26
5 0 0 0 0 0 -5 15 20 20 26 26 21
4 0 10 0 0 0 0 15 30 30 36 36 31
2 0 0 0 0 -7 0 15 30 30 36 29 24
With the cumulative sums calculated, shape the dataframe back into how it originally looked and take the max of each group.
df3 = df2.stack().reset_index()
print(df3.loc[df3.groupby(by='segment')['cumsum_dollars'].idxmax()])
segment model1 model2 cumsum_dollars dollars
43 A 4 4 36 0
14 B 5 6 15 0
These thresholds where there aren't any observations are actually more valuable than picking any of the options that do have data. Note that idxmax() returns the first occurrence of the maximum, which is sufficient for my purposes. If you need to break ties, additional filtering/sorting would be be required before calling idxmax().

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python group by only neighbours - python

Related

Pandas add column on condition: If value of cell is True set value of largest number in Period to true

Pandas - changing rows where less than n subsequent values are equal

Subtract fixed row value in reference to column value in pandas dataframe

How to downsampling time series data in pandas?

Pandas sum above all possible thresholds

Categories

Resources