Pandas reorder rows of dataframe - python

I stumble upon very peculiar problem in Pandas. I have this dataframe
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,2.0,3
1,1600349033921620000,1,18.5371406,-14.224917,0,-0.0113912,1.443597,20,0.5,0.9,-1,7,2.0,3
2,1600349033921650000,2,19.808648100000006,-6.778450599999998,0,0.037289,-1.0557937,20,0.5,0.9,-1,7,2.0,3
3,1600349033921670000,3,22.1796988,-5.7078115999999985,0,0.2585675,-1.2431861000000002,20,0.5,0.9,-1,7,2.0,3
4,1600349033921670000,4,20.757325,-16.115366,0,-0.2528627,0.7889673,20,0.5,0.9,-1,7,2.0,3
5,1600349033921690000,5,20.9491012,-17.7806833,0,0.5062633,0.9386511,20,0.5,0.9,-1,7,2.0,3
6,1600349033921690000,6,20.6225258,-5.5344404,0,-0.1192678,-0.7889041,20,0.5,0.9,-1,7,2.0,3
7,1600349033921700000,7,21.8077004,-14.736984,0,-0.0295737,1.3084618,20,0.5,0.9,-1,7,2.0,3
8,1600349033954560000,0,23.206789800000006,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,2.0,3
9,1600349033954570000,1,18.555421300000006,-13.7440508,0,0.0548418,1.4426004,20,0.5,0.9,-1,7,2.0,3
10,1600349033954570000,2,19.8409748,-7.126075500000002,0,0.0969802,-1.0428747,20,0.5,0.9,-1,7,2.0,3
11,1600349033954580000,3,22.3263185,-5.9586202,0,0.4398591,-0.752425,20,0.5,0.9,-1,7,2.0,3
12,1600349033954590000,4,20.7154136,-15.842398800000002,0,-0.12573430000000002,0.8189016,20,0.5,0.9,-1,7,2.0,3
13,1600349033954590000,5,21.038901,-17.4111883,0,0.2693992,1.108485,20,0.5,0.9,-1,7,2.0,3
14,1600349033954600000,6,20.612499,-5.810969,0,-0.030080400000000007,-0.8295869,20,0.5,0.9,-1,7,2.0,3
15,1600349033954600000,7,21.7872537,-14.3011986,0,-0.0613401,1.3073578,20,0.5,0.9,-1,7,2.0,3
16,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
17,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
18,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
This is input file
Please note that Id always starts at 0 up to 7 and repeat and time column is in sequential step (which implies that previous row should be smaller or equal to current one).
I would like to reorder rows of the dataframe as it is below.
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.0,2
1,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.0,2
2,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.0,2
3,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,1
4,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,1
5,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,1
6,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
7,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
8,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
9,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,3
10,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,3
11,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,3
This is the desired result
Please note that I need to reorder dataframe rows based on this columns id, time, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP, SIM.
As you see from the desired result we need to reoder dataframe in that way time column from smallest to largest one this holds true for the rest of columns, id, sim, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP.
I tried to sort by several columns without success. Moreover, I tried to use groupby but I failed.
Would you like to help to solve the problem? Any suggestions are welcome.
P.S.
I have paste dataframe so they can be read easily with clipboard function in order to be easily reproducible.
I am attaching pic as well.

What did you try to sort by several columns?
In [10]: df.sort_values(['id', 'time', 'ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP', 'SIM'])
Out[10]:
Unnamed: 0 time id X Y theta Vx Vy ANGLE_FR DANGER_RAD RISK_RAD TTC_DAN_LOW TTC_DAN_UP TTC_STOP SIM
0 0 1600349033921610000 0 23.2644 -7.1409 0 0.0210 -1.1414 20 0.5 0.9 -1 7 2 3
8 8 1600349033954560000 0 23.2068 -7.5171 0 -0.1728 -1.1285 20 0.5 0.9 -1 7 2 3
1 1 1600349033921620000 1 18.5371 -14.2249 0 -0.0114 1.4436 20 0.5 0.9 -1 7 2 3
9 9 1600349033954570000 1 18.5554 -13.7441 0 0.0548 1.4426 20 0.5 0.9 -1 7 2 3
2 2 1600349033921650000 2 19.8086 -6.7785 0 0.0373 -1.0558 20 0.5 0.9 -1 7 2 3

How about this:
groupby_cols = ['ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP, SIM']
df = df.groupby(groupby_cols).reset_index()

Related

Find index of first bigger row of current value in pandas dataframe

I have big dataset of values as follow:
column "bigger" would be index of the first row with bigger "bsl" than "mb" from current row. I need to do it without loop as I need it to be done in less than a second. by loop it's over a minute.
For example for the first row (with index 74729) the bigger is going to be 74731. I know it can be done by linq in C# but I'm almost new in python.
here is another example:
here is text version:
index bsl mb bigger
74729 47091.89 47160.00 74731.0
74730 47159.00 47201.00 74735.0
74731 47196.50 47201.50 74735.0
74732 47186.50 47198.02 74735.0
74733 47191.50 47191.50 74735.0
74734 47162.50 47254.00 74736.0
74735 47252.50 47411.50 74736.0
74736 47414.50 47421.00 74747.0
74737 47368.50 47403.00 74742.0
74738 47305.00 47310.00 74742.0
74739 47292.00 47320.00 74742.0
74740 47302.00 47374.00 74742.0
74741 47291.47 47442.50 74899.0
74742 47403.50 47416.50 74746.0
74743 47354.34 47362.50 74746.0
I'm not sure how many rows you have, but if the number is reasonable, you can perform a pairwise comparison:
# get data as arrays
a = df['bsl'].to_numpy()
b = df['mb'].to_numpy()
idx = df.index.to_numpy()
# compare values and mask lower triangle
# to ensure comparing only the greater indices
out = np.triu(a>b[:,None]).argmax(1).astype(float)
# reindex to original indices
idx = idx[out]
# mask invalid indices
idx[out<np.arange(len(out))] = np.nan
df['bigger'] = idx
Output:
bsl mb bigger
0 1 2 2.0
1 2 4 6.0
2 3 3 5.0
3 2 1 3.0
4 3 5 NaN
5 4 2 5.0
6 5 1 6.0
7 1 0 7.0

I need to create a dataframe were values reference previous rows

I am just starting to use python and im trying to learn some of the general things about it. As I was playing around with it I wanted to see if I could make a dataframe that shows a starting number which is compounded by a return. Sorry if this description doesnt make much sense but I basically want a dataframe x long that shows me:
number*(return)^(row number) in each row
so for example say number is 10 and the return is 10% so i would like for the dataframe to give me the series
1 11
2 12.1
3 13.3
4 14.6
5 ...
6 ...
Thanks so much in advanced!
Let us try
import numpy as np
val = 10
det = 0.1
n = 4
out = 10*((1+det)**np.arange(n))
s = pd.Series(out)
s
Out[426]:
0 10.00
1 11.00
2 12.10
3 13.31
dtype: float64
Notice here I am using the index from 0 , since 1.1**0 will yield the original value
I think this does what you want:
df = pd.DataFrame({'returns': [x for x in range(1, 10)]})
df.index = df.index + 1
df.returns = df.returns.apply(lambda x: (10 * (1.1**x)))
print(df)
Out:
returns
1 11.000000
2 12.100000
3 13.310000
4 14.641000
5 16.105100
6 17.715610
7 19.487171
8 21.435888
9 23.579477

Filter pandas rows based on trend in values

I have the following table in pandas:
Client
Evol_2019
Evol_2020
Evol_2021
Juice Factory
0
1
-1
Food Factory
-1
0
-2
Cloth Factory
2
0
0
I would like to display only rows which observe a trend over the years, that is to say which have multiple negative or positive values in their evolutions. In that case, only Food Factory would be displayed.
I didn't find any conditions satisfying this requirement, any idea ?
Assuming your "trend" is that all values are smaller or equal to zero, you could do this:
df = df[(df <= 0).all(1)]
Consider this example
import pandas as pd
df = pd.DataFrame(np.random.randn(10, 3))
that is,
0 1 2
0 -0.888926 0.058545 -1.256491
1 0.477024 -2.519239 -0.110326
2 1.884556 0.714018 0.977505
3 0.132514 -1.374656 -0.727327
4 0.219045 0.354403 -0.183413
5 0.343402 -0.302415 -0.372308
6 -0.699375 -0.492723 0.694994
7 1.460814 -0.294340 -1.305795
8 -0.177625 0.499749 0.001147
9 -0.575742 -0.148443 -1.766909
then
df[(df <= 0).all(1)
0 1 2
9 -0.575742 -0.148443 -1.766909
However, you need to be more precise with what you mean by trend.

Merging two dataframes based on index

I've been on this all night, and just can't figure it out, even though I know it should be simple. So, my sincerest apologies for the following incantation from a sleep-deprived fellow:
So, I have four fields, Employee ID, Name, Station and Shift (ID is non-null integer, the rest are strings or null).
I have about 10 dataframes, all indexed by ID. And each containing only two columns either (Name and Station) or (Name and Shift)
Now of course, I want to combine all of this into one dataframe, which has a unique row for each ID.
But I'm really frustrated by it at this point(especially because I can't find a way to directly check how many unique indices my final dataframe ends with)
After messing around with some very ugly ways of using .merge(), I finally found .concat(). But it keeps making multiple rows per ID, when I check in excel, the indices are like Table1/1234, Table2/1234 etc. One row has the shift, the other one has station, which is precisely what I'm trying to avoid.
How do I compile all my data into one dataframe, having exactly one row per ID? Possibly without using 9 different merge statements, as I have to scale up later.
If I understand your question correctly, this is the thing that you want.
For example with this 3 dataframes..
In [1]: df1
Out[1]:
0 1 2
0 3.588843 3.566220 6.518865
1 7.585399 4.269357 4.781765
2 9.242681 7.228869 5.680521
3 3.600121 3.931781 4.616634
4 9.830029 9.177663 9.842953
5 2.738782 3.767870 0.925619
6 0.084544 6.677092 1.983105
7 5.229042 4.729659 8.638492
8 8.575547 6.453765 6.055660
9 4.386650 5.547295 8.475186
In [2]: df2
Out[2]:
0 1
0 95.013170 90.382886
2 1.317641 29.600709
4 89.908139 21.391058
6 31.233153 3.902560
8 17.186079 94.768480
In [3]: df
Out[3]:
0 1 2
0 0.777689 0.357484 0.753773
1 0.271929 0.571058 0.229887
2 0.417618 0.310950 0.450400
3 0.682350 0.364849 0.933218
4 0.738438 0.086243 0.397642
5 0.237481 0.051303 0.083431
6 0.543061 0.644624 0.288698
7 0.118142 0.536156 0.098139
8 0.892830 0.080694 0.084702
9 0.073194 0.462129 0.015707
You can do
pd.concat([df,df1,df2], axis=1)
This produces
In [6]: pd.concat([df,df1,df2], axis=1)
Out[6]:
0 1 2 0 1 2 0 1
0 0.777689 0.357484 0.753773 3.588843 3.566220 6.518865 95.013170 90.382886
1 0.271929 0.571058 0.229887 7.585399 4.269357 4.781765 NaN NaN
2 0.417618 0.310950 0.450400 9.242681 7.228869 5.680521 1.317641 29.600709
3 0.682350 0.364849 0.933218 3.600121 3.931781 4.616634 NaN NaN
4 0.738438 0.086243 0.397642 9.830029 9.177663 9.842953 89.908139 21.391058
5 0.237481 0.051303 0.083431 2.738782 3.767870 0.925619 NaN NaN
6 0.543061 0.644624 0.288698 0.084544 6.677092 1.983105 31.233153 3.902560
7 0.118142 0.536156 0.098139 5.229042 4.729659 8.638492 NaN NaN
8 0.892830 0.080694 0.084702 8.575547 6.453765 6.055660 17.186079 94.768480
9 0.073194 0.462129 0.015707 4.386650 5.547295 8.475186 NaN NaN
For more details you might want to see pd.concat
Just a tip putting simple illustrative data in your question always helps in getting answer.

Pandas Vectorization with Function on Parts of Column

So I have a dataframe that looks something like this:
df1 = pd.DataFrame([[1,2, 3], [5,7,8], [2,5,4]])
0 1 2
0 1 2 3
1 5 7 8
2 2 5 4
I then have a function that adds 5 to a number called add5. I'm trying to create a new column in df1 that adds 5 to all the numbers in column 2 that are greater than 3. I want to use vectorization not apply as this concept is going to be expanded to a dataset with hundreds of thousands of entries and speed will be important. I can do it without the greater than 3 constraint like this:
df1['3'] = add5(df1[2])
But my goal is to do something like this:
df1['3'] = add5(df1[2]) if df1[2] > 3
Hoping someone can point me in the right direction on this. Thanks!
With Pandas, a function applied explicitly to each row typically cannot be vectorised. Even implicit loops such as pd.Series.apply will likely be inefficient. Instead, you should use true vectorised operations, which lean heavily on NumPy in both functionality and syntax.
In this case, you can use numpy.where:
df1[3] = np.where(df1[2] > 3, df1[2] + 5, df1[2])
Alternatively, you can use pd.DataFrame.loc in a couple of steps:
df1[3] = df1[2]
df1.loc[df1[2] > 3, 3] = df1[2] + 5
In each case, the term df1[2] > 3 creates a Boolean series, which is then used to mask another series.
Result:
print(df1)
0 1 2 3
0 1 2 3 3
1 5 7 8 13
2 2 5 4 9

Categories

Resources