Pandas set start and end based on consecutive category - python

So there are similar questions on stack overflow but none that quite address this and I can't really figure this one out. I have a pandas DataFrame that looks like this:
Account Size
------------------
11120011 0
11130212 0
21023123 1
22109832 2
28891902 2
33390909 0
34123495 0
34490909 0
And for the all the accounts that have size==0 I'd like to collapse them like so:
Account Start Size Account End
---------------------------------------
11120011 0 11130212
21023123 1 21023123
22109832 2 22109832
28891902 2 28891902
33390909 0 34490909
The Accounts with size!=0 can just repeat in both columns but for the ones with size=0 I'd just like to keep the beginning and end of that particular segment. The df is sorted on Account already.
Help is appreciated. Thanks.

IIUC, using diff + cumsum create the groupkey , then do agg
m1=df.Size.diff().ne(0)
m2=df.Size.ne(0)
df.groupby((m1|m2).cumsum()).agg({'Account':['first','last'],'Size':'first'})
Out[97]:
Size Account
first first last
Size
1 0 11120011 11130212
2 1 21023123 21023123
3 2 22109832 22109832
4 2 28891902 28891902
5 0 33390909 34490909

Late to the party but I think this also works.
df['Account End'] = df.shift(-1)[(df.Size == 0)]['Account']
Still in the learning phase for pandas, if this is bad for any reason let me know. Thanks.

Related

Pandas mutliIndex sort by group

I would like to keep values the same order (descending), but I am unable to group by index level 0 the following frame. The block with code 0512 should come together keeping descending order by code.
code product count
0510 あたたか新潟こしひかり 5kg           1
0511 キッコ−マン 味わいリッチ減塩しょうゆ 450ml 1
7プレミアム 国産果汁使用ゆずぽん酢 200ml  1
0512 キリン 生茶 525ml              1
キリンレモン 450ml              1
コカ・コーラ い・ろ・は・す もも 555ML   1
サントリー なっちゃん オレンジ 425ml    1
サントリー プレミアムボス ブラック 490ml  2
サントリー 天然水南アルプス 2L ケース     1
サントリー 天然水南アルプス 2L ペット     1
サントリー 朝摘みオレンジ&天然水 540ml   1
大塚 ポカリスエット 900ML ペット      1
森永 inゼリー エネルギーレモン 180g    1
綾鷹 525MLペット               2
7プレミアム パイナップルサイダー 500ml   1
7プレミアム フルーツオ・レ 500ml      1
GAクラフトマン ダークモカ 440ml      1
UCC 職人の珈琲 無糖 930ML ペット    1
0513 アサヒ オフ 500ml×6            1
キリン 本麒麟 500ml             1
万上 濃厚熟成本みりん 1L            1
東村山純米酒 720ml              1
0514 ブルボン プチポテトコンソメ味 45g       1
ロッテ ガーナローストミルク 50g        1
ロッテ グリーンガム 9枚             1
My code
data = df.groupby(['code','product']).size().reset_index(name='counts').set_index(['code','product'])
data1 = data.sort_values(by=['counts','code'], ascending=False).groupby(['product','code']).sum()
EDIT:
I could see that the second groupby put the code together but mess up the descending order of count per code as we can see for 0512.
You should pass a list to the ascending argument in the second line, like this:
data1 = data.sort_values(by=['counts','code'], ascending=[False,False]).groupby(['product','code']).sum()
Otherwise, it would take the default value which is True for "code" column.

Filter pandas rows based on trend in values

I have the following table in pandas:
Client
Evol_2019
Evol_2020
Evol_2021
Juice Factory
0
1
-1
Food Factory
-1
0
-2
Cloth Factory
2
0
0
I would like to display only rows which observe a trend over the years, that is to say which have multiple negative or positive values in their evolutions. In that case, only Food Factory would be displayed.
I didn't find any conditions satisfying this requirement, any idea ?
Assuming your "trend" is that all values are smaller or equal to zero, you could do this:
df = df[(df <= 0).all(1)]
Consider this example
import pandas as pd
df = pd.DataFrame(np.random.randn(10, 3))
that is,
0 1 2
0 -0.888926 0.058545 -1.256491
1 0.477024 -2.519239 -0.110326
2 1.884556 0.714018 0.977505
3 0.132514 -1.374656 -0.727327
4 0.219045 0.354403 -0.183413
5 0.343402 -0.302415 -0.372308
6 -0.699375 -0.492723 0.694994
7 1.460814 -0.294340 -1.305795
8 -0.177625 0.499749 0.001147
9 -0.575742 -0.148443 -1.766909
then
df[(df <= 0).all(1)
0 1 2
9 -0.575742 -0.148443 -1.766909
However, you need to be more precise with what you mean by trend.

How to use or command in pandas to categorize my Data

I think it might be a noob question, but I'm new to coding. I used the following code to categorize my data. But I need to command that if, e.g., not all my conditions together fulfill the categories terms, e.g., consider only 4 out of 7 conditions, and give me the mentioned category. How can I do it? I really appreciate any help you can provide.
c1=df['Stroage Condition'].eq('refrigerate')
c2=df['Profit Per Unit'].between(100,150)
c3=df['Inventory Qty']<20
df['Restock Action']=np.where(c1&c2&c3,'Hold Current stock level','On Sale')
print(df)
Let`s say this is your dataframe:
Stroage Condition refrigerate Profit Per Unit Inventory Qty
0 0 1 0 20
1 1 1 102 1
2 2 2 5 2
3 3 0 100 8
and the conditions are the ones you defined:
c1=df['Stroage Condition'].eq(df['refrigerate'])
c2=df['Profit Per Unit'].between(100,150)
c3=df['Inventory Qty']<20
Then you can define a lambda function and pass this to your np.where() function. There you can define how many conditions have to be True. In this example I set the value to at least two.
def my_select(x,y,z):
return np.array([x,y,z]).sum(axis=0) >= 2
Finally you run one more line:
df['Restock Action']=np.where(my_select(c1,c2,c3), 'Hold Current stock level', 'On Sale')
print(df)
This prints to the console:
Stroage Condition refrigerate Profit Per Unit Inventory Qty Restock Action
0 0 1 0 20 On Sale
1 1 1 102 1 Hold Current stock level
2 2 2 5 2 Hold Current stock level
3 3 0 100 8 Hold Current stock level
If you have more conditions or rules, you have extend the lambda function with as many variables as rules.

Pandas reorder rows of dataframe

I stumble upon very peculiar problem in Pandas. I have this dataframe
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,2.0,3
1,1600349033921620000,1,18.5371406,-14.224917,0,-0.0113912,1.443597,20,0.5,0.9,-1,7,2.0,3
2,1600349033921650000,2,19.808648100000006,-6.778450599999998,0,0.037289,-1.0557937,20,0.5,0.9,-1,7,2.0,3
3,1600349033921670000,3,22.1796988,-5.7078115999999985,0,0.2585675,-1.2431861000000002,20,0.5,0.9,-1,7,2.0,3
4,1600349033921670000,4,20.757325,-16.115366,0,-0.2528627,0.7889673,20,0.5,0.9,-1,7,2.0,3
5,1600349033921690000,5,20.9491012,-17.7806833,0,0.5062633,0.9386511,20,0.5,0.9,-1,7,2.0,3
6,1600349033921690000,6,20.6225258,-5.5344404,0,-0.1192678,-0.7889041,20,0.5,0.9,-1,7,2.0,3
7,1600349033921700000,7,21.8077004,-14.736984,0,-0.0295737,1.3084618,20,0.5,0.9,-1,7,2.0,3
8,1600349033954560000,0,23.206789800000006,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,2.0,3
9,1600349033954570000,1,18.555421300000006,-13.7440508,0,0.0548418,1.4426004,20,0.5,0.9,-1,7,2.0,3
10,1600349033954570000,2,19.8409748,-7.126075500000002,0,0.0969802,-1.0428747,20,0.5,0.9,-1,7,2.0,3
11,1600349033954580000,3,22.3263185,-5.9586202,0,0.4398591,-0.752425,20,0.5,0.9,-1,7,2.0,3
12,1600349033954590000,4,20.7154136,-15.842398800000002,0,-0.12573430000000002,0.8189016,20,0.5,0.9,-1,7,2.0,3
13,1600349033954590000,5,21.038901,-17.4111883,0,0.2693992,1.108485,20,0.5,0.9,-1,7,2.0,3
14,1600349033954600000,6,20.612499,-5.810969,0,-0.030080400000000007,-0.8295869,20,0.5,0.9,-1,7,2.0,3
15,1600349033954600000,7,21.7872537,-14.3011986,0,-0.0613401,1.3073578,20,0.5,0.9,-1,7,2.0,3
16,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
17,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
18,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
This is input file
Please note that Id always starts at 0 up to 7 and repeat and time column is in sequential step (which implies that previous row should be smaller or equal to current one).
I would like to reorder rows of the dataframe as it is below.
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.0,2
1,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.0,2
2,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.0,2
3,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,1
4,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,1
5,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,1
6,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
7,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
8,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
9,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,3
10,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,3
11,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,3
This is the desired result
Please note that I need to reorder dataframe rows based on this columns id, time, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP, SIM.
As you see from the desired result we need to reoder dataframe in that way time column from smallest to largest one this holds true for the rest of columns, id, sim, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP.
I tried to sort by several columns without success. Moreover, I tried to use groupby but I failed.
Would you like to help to solve the problem? Any suggestions are welcome.
P.S.
I have paste dataframe so they can be read easily with clipboard function in order to be easily reproducible.
I am attaching pic as well.
What did you try to sort by several columns?
In [10]: df.sort_values(['id', 'time', 'ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP', 'SIM'])
Out[10]:
Unnamed: 0 time id X Y theta Vx Vy ANGLE_FR DANGER_RAD RISK_RAD TTC_DAN_LOW TTC_DAN_UP TTC_STOP SIM
0 0 1600349033921610000 0 23.2644 -7.1409 0 0.0210 -1.1414 20 0.5 0.9 -1 7 2 3
8 8 1600349033954560000 0 23.2068 -7.5171 0 -0.1728 -1.1285 20 0.5 0.9 -1 7 2 3
1 1 1600349033921620000 1 18.5371 -14.2249 0 -0.0114 1.4436 20 0.5 0.9 -1 7 2 3
9 9 1600349033954570000 1 18.5554 -13.7441 0 0.0548 1.4426 20 0.5 0.9 -1 7 2 3
2 2 1600349033921650000 2 19.8086 -6.7785 0 0.0373 -1.0558 20 0.5 0.9 -1 7 2 3
How about this:
groupby_cols = ['ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP, SIM']
df = df.groupby(groupby_cols).reset_index()

How not to use loop in a df when access previous lines

I use pandas to process transport data. I study attendance of bus lines. I have 2 columns to count people getting on and off the bus at each stop of the bus. I want to create one which count the people currently on board. At the moment, i use a loop through the df and for the line n, it does : current[n]=on[n]-off[n]+current[n-1] as showns in the following example:
for index,row in df.iterrows():
if index == 0:
df.loc[index,'current']=df.loc[index,'on']
else :
df.loc[index,'current']=df.loc[index,'on']-df.loc[index,'off']+df.loc[index-1,'current']
Is there a way to avoid using a loop ?
Thanks for your time !
You can use Series.cumsum(), which accumulates the the numbers in a given Series.
a = pd.DataFrame([[3,4],[6,4],[1,2],[4,5]], columns=["off", "on"])
a["current"] = a["on"].cumsum() - a["off"].cumsum()
off on current
0 3 4 1
1 6 4 -1
2 1 2 0
3 4 5 1
If I've understood the problem properly, you could calculate the difference between people getting on and off, then have a running total using Series.cumsum():
import pandas as pd
# Create dataframe for demo
d = {'Stop':['A','B','C','D'],'On':[3,2,3,2],'Off':[2,1,0,1]}
df = pd.DataFrame(data=d)
# Get difference between 'On' and 'Off' columns.
df['current'] = df['On']-df['Off']
# Get cumulative sum of column
df['Total'] = df['current'].cumsum()
# Same thing in one line
df['Total'] = (df['On']-df['Off']).cumsum()
Stop On Off Total
A 3 2 1
B 2 1 2
C 3 0 5
D 2 1 6

Categories

Resources