Conditionnal Rolling count - python

Here is my dataframe:
score
1
62
7
15
167
73
25
24
2
76
I want to compare a score with the previous 4 scores and count the number of scores higher than the current one.
This is my expected output:
score count
1
62
7
15
167 0
73 1 (we take 4 previous scores : 167,15,7,62 there is just 167 > 73 so count 1)
25 2
24 3
2 4
76 0
If somebody has an idea on how to do that, you are welcome
Thanks!

I do not think your output is according to your question. However, if you do look only at the previous 4 elements, then you could implement the following:
scores = [1, 62, 7, 15, 167, 73, 25, 24, 2, 76]
highers = []
for index, score in enumerate(scores[4:]):
higher = len([s for s in scores[index:index+4] if score < s])
print(higher)
highers.append(higher)
print(highers)
# [0, 1, 2, 3, 4, 0]
Then, you could just add this highers list as a pandas column:
df['output'] = [0]*4 + highers
Note that I pad the output in such way here that I assign zeros to the first four values.

Related

Comparison between rows based on custom function

I have a large size data frame such as below:
Vehicle longitude latitude trip
0 33 155 0
0 34 156 1
1 32 154 2
1 37 154 5
2 25 145 2
. . . .
. . . .
I also defined a custom boolean function to check if coordination is inside a specific area.
def check(main_vehicle_latitude,main_vehicle_longitude,radius,compare_vehicle_latitdude,compare_vehicle_longitude):
if condition:
x=True
return X
Now I want to apply this function to (each row) of my data frame in a way that I compare each vehicle/trip with all other vehicle coordinates and find all the vehicles that have a similar trip location so the final output would be a list for each vehicle that includes all other vehicles that have a similar trip location. For example, the coordinates of vehicle (0) and trip (0) should be compared with all other vehicles to find a list of all vehicles that have similar start coordinates with the first vehicle (trip 0) and continue to check this for all vehicle trips. It seems a bit complicated to explain but hopefully, it was clear enough. I'm looking for a very efficient way since the data frame is large but unfortunately, I'm a beginner so any help with this would be greatly appreciated.
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
{
"vehicle": [0, 0, 1, 1, 2, 3, 4, 5],
"longitude": [33, 34, 32, 37, 25, 33, 37, 33],
"latitude": [155, 156, 154, 154, 145, 155, 154, 155],
"trip": [0, 1, 2, 5, 2, 1, 0, 6],
}
)
print(df)
# Output
vehicle longitude latitude trip
0 0 33 155 0
1 0 34 156 1
2 1 32 154 2
3 1 37 154 5
4 2 25 145 2
5 3 33 155 1
6 4 37 154 0
7 5 33 155 6
Here is one way to do it with Pandas groupby, explode, and concat:
# Find matches
tmp = df.groupby(["longitude", "latitude"]).agg(list).reset_index(drop=True)
tmp["match"] = tmp.apply(lambda x: 1 if len(x["vehicle"]) > 1 else pd.NA, axis=1)
tmp = tmp.dropna()
# Format results
tmp["match"] = tmp.apply(
lambda x: [[v, t] for v, t in zip(x["vehicle"], x["trip"])], axis=1
)
tmp = tmp.explode("vehicle")
tmp = tmp.explode("trip")
tmp["match"] = tmp.apply(
lambda x: x["match"] if [x["vehicle"], x["trip"]] in x["match"] else pd.NA, axis=1
)
tmp = tmp.dropna()
tmp["match"] = tmp.apply(
lambda x: [p[0] for p in x["match"] if p[0] != x["vehicle"]], axis=1
)
# Add results back to initial dataframe
df = pd.concat(
[df.set_index(["vehicle", "trip"]), tmp.set_index(["vehicle", "trip"])], axis=1
)
Then:
print(df)
# Output
longitude latitude match
vehicle trip
0 0 33 155 [3, 5]
1 34 156 NaN
1 2 32 154 NaN
5 37 154 [4]
2 2 25 145 NaN
3 1 33 155 [0, 5]
4 0 37 154 [1]
5 6 33 155 [0, 3]

How to select the rows with same absolute value in a column

I want to select rows 0, 1, 3, and 4 and other rows with values that have the same absolution values. Note that assume we don't know the values (there could be -25, 25, -2356, 2356, etc.)
test = pd.DataFrame({'id':[1, 2, 3, 4, 5],
'quantity':[20, 30, 40, -30, -20]})
id quantity
0 1 20
1 2 30
2 3 40
3 4 -30
4 5 -20
.....
What is the best way of doing this?
IIUC, you want to filter the rows that have at least 2 times a value in absolute form. You could use groupby on the abs value:
test[test.groupby(test['quantity'].abs())['quantity'].transform('size').ge(2)]
If you want to ensure that you have both the negative and positive value, make it a set and check that there are 2 elements (the positive and negative):
test[test.groupby(test['quantity'].abs())['quantity'].transform(lambda g: len(set(g))==2)]
output:
id quantity
0 1 20
1 2 30
3 4 -30
4 5 -20

How do I put the output in an array?

How do I put the output generated in an array? I then want to subtract each element by its preceding element. For example: 6-0, 7-6, 10-7 etc to get the running length.
for index, item in enumerate(binary_sequence):
if item == 1:
print(index)
Output:
6
7
10
11
15
16
19
30
35
44
48
49
51
54
55
56
57
60
74
76
78
80
85
90
97
98
Python has ways to avoid indexing for common manipulations
zip lets you pair lists and iterate over the paired values, pairing a list with a offset version of itself is common
seq = [*range(7, 30, 4)]
seq
Out[28]: [7, 11, 15, 19, 23, 27]
out = []
for a, b in zip(seq, [0] + seq): # '[0] + seq' puts a '0' in front, shifts rest over
out.append(a - b)
print(out)
[7, 4, 4, 4, 4, 4]

merge rows pandas dataframe based on condition

Hi have a dataframe df
containing a set of events (rows).
df = pd.DataFrame(data=[[1, 2, 7, 10],
[10, 22, 1, 30],
[30, 42, 2, 10],
[100,142, 22,1],
[143, 152, 2, 10],
[160, 162, 12, 11]],columns=['Start','End','Value1','Value2'])
df
Out[15]:
Start End Value1 Value2
0 1 2 7 10
1 10 22 1 30
2 30 42 2 10
3 100 142 22 1
4 143 152 2 10
5 160 162 12 11
If 2 (or more) consecutive events are <= 10 far apart I would like to merge the 2 (or more) events (i.e. use the start of the first event, end of the last and sum the values in Value1 and Value2).
In the example above df becomes:
df
Out[15]:
Start End Value1 Value2
0 1 42 10 50
1 100 162 36 22
That's totally possible:
df.groupby(((df.Start - df.End.shift(1)) > 10).cumsum()).agg({'Start':min, 'End':max, 'Value1':sum, 'Value2': sum})
Explanation:
start_end_differences = df.Start - df.End.shift(1) #shift moves the series down
threshold_selector = start_end_differences > 10 # will give you a boolean array where true indicates a point where the difference more than 10.
groups = threshold_selector.cumsum() # sums up the trues (1) and will create an integer series starting from 0
df.groupby(groups).agg({'Start':min}) # the aggregation is self explaining
Here is a generalized solution that remains agnostic of the other columns:
cols = df.columns.difference(['Start', 'End'])
grps = df.Start.sub(df.End.shift()).gt(10).cumsum()
gpby = df.groupby(grps)
gpby.agg(dict(Start='min', End='max')).join(gpby[cols].sum())
Start End Value1 Value2
0 1 42 10 50
1 100 162 36 22

How do I perform a summation of `n` rows at a time in pandas? [duplicate]

This question already has answers here:
Calculate average of every x rows in a table and create new table
(6 answers)
Closed 5 years ago.
Given a data frame
A
0 14
1 59
2 38
3 40
4 99
5 89
6 70
7 64
8 84
9 40
10 30
11 94
12 65
13 29
14 48
15 26
16 80
17 79
18 74
19 69
This data frame has 20 columns. I would like to group n=5 rows at a time and sum them up. So, my output would look like this:
A
0 250
1 347
2 266
3 328
df.rolling_sum will not help because it does not allow you to vary the stride when summing.
What other ways are there to do this?
df.set_index(df.index // 5).sum(level=0)
If you can manage an ndarray with the sums as opposed to a Series (you could always construct a Series again anyhow), you could use np.add.reduceat.
np.add.reduceat(df.A.values, np.arange(0, df.A.size, 5))
Which in this case returns
array([250, 347, 266, 328])
Assuming your indices are contiguous, you can perform integer division on df.index, and then group by index.
For the df above, you can do this:
df.index // 5
# Int64Index([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3], dtype='int64')
Getting the final answer is just one more step, using df.groupby and dfGroupBy.sum:
df.groupby(df.index // 5).sum()
A
0 250
1 347
2 266
3 328
If you do not have a RangeIndex, use df.reset_index first and then group.

Categories

Resources