I stumble upon very peculiar problem in Pandas. I have this dataframe
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,2.0,3
1,1600349033921620000,1,18.5371406,-14.224917,0,-0.0113912,1.443597,20,0.5,0.9,-1,7,2.0,3
2,1600349033921650000,2,19.808648100000006,-6.778450599999998,0,0.037289,-1.0557937,20,0.5,0.9,-1,7,2.0,3
3,1600349033921670000,3,22.1796988,-5.7078115999999985,0,0.2585675,-1.2431861000000002,20,0.5,0.9,-1,7,2.0,3
4,1600349033921670000,4,20.757325,-16.115366,0,-0.2528627,0.7889673,20,0.5,0.9,-1,7,2.0,3
5,1600349033921690000,5,20.9491012,-17.7806833,0,0.5062633,0.9386511,20,0.5,0.9,-1,7,2.0,3
6,1600349033921690000,6,20.6225258,-5.5344404,0,-0.1192678,-0.7889041,20,0.5,0.9,-1,7,2.0,3
7,1600349033921700000,7,21.8077004,-14.736984,0,-0.0295737,1.3084618,20,0.5,0.9,-1,7,2.0,3
8,1600349033954560000,0,23.206789800000006,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,2.0,3
9,1600349033954570000,1,18.555421300000006,-13.7440508,0,0.0548418,1.4426004,20,0.5,0.9,-1,7,2.0,3
10,1600349033954570000,2,19.8409748,-7.126075500000002,0,0.0969802,-1.0428747,20,0.5,0.9,-1,7,2.0,3
11,1600349033954580000,3,22.3263185,-5.9586202,0,0.4398591,-0.752425,20,0.5,0.9,-1,7,2.0,3
12,1600349033954590000,4,20.7154136,-15.842398800000002,0,-0.12573430000000002,0.8189016,20,0.5,0.9,-1,7,2.0,3
13,1600349033954590000,5,21.038901,-17.4111883,0,0.2693992,1.108485,20,0.5,0.9,-1,7,2.0,3
14,1600349033954600000,6,20.612499,-5.810969,0,-0.030080400000000007,-0.8295869,20,0.5,0.9,-1,7,2.0,3
15,1600349033954600000,7,21.7872537,-14.3011986,0,-0.0613401,1.3073578,20,0.5,0.9,-1,7,2.0,3
16,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
17,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
18,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
This is input file
Please note that Id always starts at 0 up to 7 and repeat and time column is in sequential step (which implies that previous row should be smaller or equal to current one).
I would like to reorder rows of the dataframe as it is below.
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.0,2
1,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.0,2
2,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.0,2
3,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,1
4,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,1
5,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,1
6,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
7,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
8,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
9,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,3
10,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,3
11,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,3
This is the desired result
Please note that I need to reorder dataframe rows based on this columns id, time, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP, SIM.
As you see from the desired result we need to reoder dataframe in that way time column from smallest to largest one this holds true for the rest of columns, id, sim, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP.
I tried to sort by several columns without success. Moreover, I tried to use groupby but I failed.
Would you like to help to solve the problem? Any suggestions are welcome.
P.S.
I have paste dataframe so they can be read easily with clipboard function in order to be easily reproducible.
I am attaching pic as well.
What did you try to sort by several columns?
In [10]: df.sort_values(['id', 'time', 'ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP', 'SIM'])
Out[10]:
Unnamed: 0 time id X Y theta Vx Vy ANGLE_FR DANGER_RAD RISK_RAD TTC_DAN_LOW TTC_DAN_UP TTC_STOP SIM
0 0 1600349033921610000 0 23.2644 -7.1409 0 0.0210 -1.1414 20 0.5 0.9 -1 7 2 3
8 8 1600349033954560000 0 23.2068 -7.5171 0 -0.1728 -1.1285 20 0.5 0.9 -1 7 2 3
1 1 1600349033921620000 1 18.5371 -14.2249 0 -0.0114 1.4436 20 0.5 0.9 -1 7 2 3
9 9 1600349033954570000 1 18.5554 -13.7441 0 0.0548 1.4426 20 0.5 0.9 -1 7 2 3
2 2 1600349033921650000 2 19.8086 -6.7785 0 0.0373 -1.0558 20 0.5 0.9 -1 7 2 3
How about this:
groupby_cols = ['ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP, SIM']
df = df.groupby(groupby_cols).reset_index()
I want to compute the expanding window of just the last few elements in a group...
df = pd.DataFrame({'B': [np.nan, np.nan, 1, 1, 2, 2, 1,1], 'A': [1, 2, 1, 2, 1, 2,1,2]})
df.groupby("A")["B"].expanding().quantile(0.5)
this gives:
1 0 NaN
2 1.0
4 1.5
6 1.0
2 1 NaN
3 1.0
5 1.5
7 1.0
I only really want the last two rows for each group though. the result should be:
1 4 1.5
6 1.0
2 5 1.5
7 1.0
I can easily calculate it all and then just get the sections I want. but this is very slow if my dataframe is 1000s of elements long and I dont want to roll across the whole window... just the last two "rolls"
EDIT: I have ammended the title; A lot of people are correctly answering part of the question, but ignoring what is IMO the important part (I should have been more clear)
The issue here is the time it takes. I could just "tail" the answer to get the last two; but then it involves calculating the first two "expanding windows" and then throwing away those results. If my dataframe was instead 1000s of rows long and I just needed the answer for the last few entries, much of this calculation would be wasting time. This is the main problem I have.
As I stated:
"I can easily calculate it all and then just get the sections I want" => through using tail.
Sorry for the confusion.
Also potentially using tail doesnt involve calculating the lot, but it still seems like it does from the timings that I have done... maybe this is not correct, it is an assumption I have made.
EDIT2: The other Option I have tried was using the min_windows in rolling to force it to not calculate the initial sections of the group, but this has many pitfalls such as: -if the array includes NaNs this doesnt work, -if the groupbys are not the same length.
EDIT3:
As a simpler problem and reasoning: Its a limitation of the expanding/or rolling window I think... say we had an array [1,2,3,4,5] the expanding windows are [1], [1,2], [1,2,3], [1,2,3,4], [1,2,3,4,5], and if we run the max over that we get: 1,2,3,4,5 (the max of each array). But if I just want the max of the last two expanding windows. I just need max[1,2,3,4] = 4 and max[1,2,3,4,5]. Intuitively I don't need to calculate max of the first 3 expanding window results to get the last two. But Pandas Implementation might be that it calculates max[1,2,3,4] as max[max[1,2,3],max[4]] = 4 in which case the calculation of the entire window is necessary... this might be the same for the quantile example. There might be an alternate way to do this however without using expanding... not sure... this is what I cant work out.
Maybe try using tail: https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.core.groupby.GroupBy.tail.html
df.groupby('A')['B'].rolling(4, min_periods=1).quantile(0.5).reset_index(level=0).groupby('A').tail(2)
Out[410]:
A B
4 1 1.5
6 1 1.0
5 2 1.5
7 2 1.0
rolling and expanding are similar
How about this (edited 06/12/2018):
def last_two_quantile(row, q):
return pd.Series([row.iloc[:-1].quantile(q), row.quantile(q)])
df.groupby('A')['B'].apply(last_two_quantile, 0.5)
Out[126]:
A
1 0 1.5
1 1.0
2 0 1.5
1 1.0
Name: B, dtype: float64
If this (or something like it) doesn't do what you desire I think you should provide a real example of your use case.
Is this you want?
df[-4:].groupby("A")["B"].expanding().quantile(0.5)
A
1 4 2.0
6 1.5
2 5 2.0
7 1.5
Name: B, dtype: float64
Hope can help you.
Solution1:
newdf = df.groupby("A")["B"].expanding().quantile(0.5).reset_index()
for i in range(newdf["A"].max()+1):
print(newdf[newdf["A"]==i][-2:],'\n')
Solution2:
newdf2 = df.groupby("A")["B"].expanding().quantile(0.5)
for i in range(newdf2.index.get_level_values("A").max()+1):
print(newdf[newdf["A"]==i][-2:],'\n')
Solution3:
for i in range(df.groupby("A")["B"].expanding().quantile(0.5).index.get_level_values("A").max()+1):
print(newdf[newdf["A"]==i][-2:],'\n')
output:
Empty DataFrame
Columns: [A, level_1, B]
Index: []
A level_1 B
2 1 4 1.5
3 1 6 1.0
A level_1 B
6 2 5 1.5
7 2 7 1.0
new solution:
newdf = pd.DataFrame(columns={"A", "B"})
for i in range(len(df["A"].unique())):
newdf = newdf.append(pd.DataFrame(df[df["A"]==i+1][:-2].sum()).T)
newdf["A"] = newdf["A"]/2
for i in range(len(df["A"].unique())):
newdf = newdf.append(df[df["A"]==df["A"].unique()[i]][-2:])
#newdf = newdf.reset_index(drop=True)
newdf["A"] = newdf["A"].astype(int)
for i in range(newdf["A"].max()+1):
print(newdf[newdf["A"]==i].groupby("A")["B"].expanding().quantile(0.5)[-2:])
output:
Series([], Name: B, dtype: float64)
A
1 4 1.5
6 1.0
Name: B, dtype: float64
A
2 5 1.5
7 1.0
Name: B, dtype: float64