Sort pandas dataframe column based on substring - python

I have a pandas dataframe, as shown below:
Timestamp_Start Event_ID Duration
555.54944 Fix_1 0.42248
559.07281 Fix_10 0.01996
559.14642 Fix_11 0
556.03192 Fix_2 0.16113
556.27985 Fix_3 0.24188
556.56097 Fix_4 0.04987
556.65497 Fix_5 0.10748
556.80859 Fix_6 0.75708
557.57983 Fix_7 0.11329
557.75348 Fix_8 0.65643
558.43665 Fix_9 0.27447
555.97925 Sac_1 0.04577
559.09961 Sac_10 0.0404
559.15302 Sac_11 0.00726
556.19916 Sac_2 0.07403
556.52747 Sac_3 0.02789
556.61865 Sac_4 0.02985
556.76849 Sac_5 0.0337
557.57294 Sac_6 0
557.69965 Sac_7 0.04687
558.41632 Sac_8 0.01325
558.71796 Sac_9 0.34552
I want to sort the 'Event_ID' column, so that Fix_1,Fix_2,Fix_3... and Sac_1,Sac_2,Sac_3... appear in order, like below:
Timestamp_StartEvent_ID Duration
555.54944 Fix_1 0.42248
556.03192 Fix_2 0.16113
556.27985 Fix_3 0.24188
556.56097 Fix_4 0.04987
556.65497 Fix_5 0.10748
556.80859 Fix_6 0.75708
557.57983 Fix_7 0.11329
557.75348 Fix_8 0.65643
558.43665 Fix_9 0.27447
559.07281 Fix_10 0.01996
559.14642 Fix_11 0
555.97925 Sac_1 0.04577
556.19916 Sac_2 0.07403
556.52747 Sac_3 0.02789
556.61865 Sac_4 0.02985
556.76849 Sac_5 0.0337
557.57294 Sac_6 0
557.69965 Sac_7 0.04687
558.41632 Sac_8 0.01325
558.71796 Sac_9 0.34552
559.09961 Sac_10 0.0404
559.15302 Sac_11 0.00726
Any ideas on how to do that? Thanks for your help.

One way using distutils.version:
import numpy as np
from distutils.version import LooseVersion
f = np.vectorize(LooseVersion)
new_df = df.sort_values("Event_ID", key=f)
print(new_df)
Output:
Timestamp_Start Event_ID Duration
0 555.54944 Fix_1 0.42248
3 556.03192 Fix_2 0.16113
4 556.27985 Fix_3 0.24188
5 556.56097 Fix_4 0.04987
6 556.65497 Fix_5 0.10748
7 556.80859 Fix_6 0.75708
8 557.57983 Fix_7 0.11329
9 557.75348 Fix_8 0.65643
10 558.43665 Fix_9 0.27447
1 559.07281 Fix_10 0.01996
2 559.14642 Fix_11 0.00000
11 555.97925 Sac_1 0.04577
14 556.19916 Sac_2 0.07403
15 556.52747 Sac_3 0.02789
16 556.61865 Sac_4 0.02985
17 556.76849 Sac_5 0.03370
18 557.57294 Sac_6 0.00000
19 557.69965 Sac_7 0.04687
20 558.41632 Sac_8 0.01325
21 558.71796 Sac_9 0.34552
12 559.09961 Sac_10 0.04040
13 559.15302 Sac_11 0.00726

Normal sorting on the dataframe will not work, as you need the integer in the string to be treated as int value.
It can be done with extra space though.
You can make two columns like this,
df['event'] = df.Event_ID.str.rsplit("_").str[0]
df['idx'] = df.Event_ID.str.rsplit("_").str[-1].astype(int)
Now, sort on these two columns,
df.sort_values(['event', 'idx'])
Timestamp_Start Event_ID Duration idx event
0 555.54944 Fix_1 0.42248 1 Fix
3 556.03192 Fix_2 0.16113 2 Fix
4 556.27985 Fix_3 0.24188 3 Fix
5 556.56097 Fix_4 0.04987 4 Fix
6 556.65497 Fix_5 0.10748 5 Fix
7 556.80859 Fix_6 0.75708 6 Fix
8 557.57983 Fix_7 0.11329 7 Fix
9 557.75348 Fix_8 0.65643 8 Fix
10 558.43665 Fix_9 0.27447 9 Fix
1 559.07281 Fix_10 0.01996 10 Fix
2 559.14642 Fix_11 0.00000 11 Fix
11 555.97925 Sac_1 0.04577 1 Sac
14 556.19916 Sac_2 0.07403 2 Sac
15 556.52747 Sac_3 0.02789 3 Sac
16 556.61865 Sac_4 0.02985 4 Sac
17 556.76849 Sac_5 0.03370 5 Sac
18 557.57294 Sac_6 0.00000 6 Sac
19 557.69965 Sac_7 0.04687 7 Sac
20 558.41632 Sac_8 0.01325 8 Sac
21 558.71796 Sac_9 0.34552 9 Sac
12 559.09961 Sac_10 0.04040 10 Sac
13 559.15302 Sac_11 0.00726 11 Sac
You can reset_index, drop extra columns as needed

Related

How to Group by the mean of specific columns in Python

In the dataframe below:
import pandas as pd
import numpy as np
df= {
'Gen':['M','M','M','M','F','F','F','F','M','M','M','M','F','F','F','F'],
'Site':['FRX','FX','FRX','FRX','FRX','FX','FRX','FX','FX','FX','FX','FRX','FRX','FRX','FRX','FRX'],
'Type':['L','L','L','L','L','L','L','L','R','R','R','R','R','R','R','R'],
'AIC':['<1','<1','<1','<1',1,1,1,1,2,2,2,2,'>2','>2','>2','>2'],
'AIC_TRX':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'diff':[-1,-1,-1,-1,0,0,0,0,1,1,1,1,3,3,3,3],
'series':[1,2,4,8,1,2,4,8,1,2,4,8,1,2,4,8],
'Grwth_Time1':[150.78,162.34,188.53,197.69,208.07,217.76,229.48,139.51,146.87,182.54,189.57,199.97,229.28,244.73,269.91,249.19],
'Grwth_Time2':[250.78,262.34,288.53,297.69,308.07,317.7,329.81,339.15,346.87,382.54,369.59,399.97,329.28,347.73,369.91,349.12],
'Grwth_Time3':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time4':[270.84,282.14,298.53,306.69,318.73,327.47,369.63,389.59,398.75,432.18,449.78,473.55,494.85,509.39,515.52,539.23],
'Grwth_Time5':[25.78,22.34,28.53,27.69,30.07,17.7,29.81,33.15,34.87,32.54,36.59,39.97,29.28,34.73,36.91,34.12],
'Grwth_Time6':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time7':[27.84,28.14,29.53,30.69,18.73,27.47,36.63,38.59,38.75,24.18,24.78,21.55,13.85,9.39,15.52,39.23],
}
df = pd.DataFrame(df,columns = ['Gen','Site','Type','AIC','AIC_TRX','diff','series','Grwth_Time1','Grwth_Time2','Grwth_Time3','Grwth_Time4','Grwth_Time5','Grwth_Time6','Grwth_Time7'])
df.info()
I want to do the following:
Find the average of each unique series per AIC_TRX for each Grwth_Time (Grwth_Time1, Grwth_Time2,....,Grwth_Time7)
Export all the outputs as one xlsx file (refer to the figure below)
The desired outputs look like the figure below (note: the numbers in this output are not the actual average values, they were randomly generated)
My attempt:
# Select the columns -> AIC_TRX, series, Grwth_Time1,Grwth_Time2,....,Grwth_Time7
df1 = df[['AIC_TRX', 'diff', 'series',
'Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3', 'Grwth_Time4',
'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']]
#Below is where I need help, I want to groupby the 'series' and 'AIC_TRX' for all the 'Grwth_Time1_to_7'
df1.groupby('series').Grwth_Time1.agg(['mean'])
Thanks in advance
You have to groupby two columns: ['series', 'AIC_TRX'] and find mean of each Grwth_Time.
df.groupby(['series', 'AIC_TRX'])[['Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3',
'Grwth_Time4', 'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']].mean().unstack().to_excel("output.xlsx")
Output:
AIC_TRX 1 2 3 4
series
1 150.78 208.07 146.87 229.28
2 162.34 217.76 182.54 244.73
4 188.53 229.48 189.57 269.91
8 197.69 139.51 199.97 249.19
AIC_TRX 1 2 3 4
series
1 250.78 308.07 346.87 329.28
2 262.34 317.70 382.54 347.73
4 288.53 329.81 369.59 369.91
8 297.69 339.15 399.97 349.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 270.84 318.73 398.75 494.85
2 282.14 327.47 432.18 509.39
4 298.53 369.63 449.78 515.52
8 306.69 389.59 473.55 539.23
AIC_TRX 1 2 3 4
series
1 25.78 30.07 34.87 29.28
2 22.34 17.70 32.54 34.73
4 28.53 29.81 36.59 36.91
8 27.69 33.15 39.97 34.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 27.84 18.73 38.75 13.85
2 28.14 27.47 24.18 9.39
4 29.53 36.63 24.78 15.52
8 30.69 38.59 21.55 39.23
Just use the df.apply method to average across each column based on series and AIC_TRX grouping.
result = df1.groupby(['series', 'AIC_TRX']).apply(np.mean, axis=1)
Result:
series AIC_TRX
1 1 0 120.738
2 4 156.281
3 8 170.285
4 12 196.270
2 1 1 122.358
2 5 152.758
3 9 184.494
4 13 205.175
4 1 2 135.471
2 6 171.968
3 10 187.825
4 14 214.907
8 1 3 142.183
2 7 162.849
3 11 196.851
4 15 216.455
dtype: float64

How to add another category in a DataFrame in python/pandas including only missing values?

I have a dataframe with two columns: 'TotalCharges', and 'Churn' with 7043 rows. In 11 cells of column 'TotalCharges' I have a missing value. What I want is to create 10 categories of TotalCharges plus one category called "MissingValues", but I can't find a way to do it. My DataFrame looks like this:
TotalCharges Churn
0 29.85 No
1 1889.5 No
2 108.15 Yes
3 1840.75 No
4 151.65 Yes
5 820.5 Yes
6 1949.4 No
7 301.9 No
8 3046.05 Yes
9 3487.95 No
10 587.45 No
11 326.8 No
12 5681.1 No
13 5036.3 Yes
14 2686.05 No
15 7895.15 No
16 missing No
17 7382.25 No
18 528.35 Yes
.... ....
.... ....
and I want to get something like this:
TotalCharges Churn TotalChargesCategories
0 29.85 No (18.799, 84.61]
1 1889.5 No (947.38, 1400.55]
2 108.15 Yes (84.61, 267.37]
3 1840.75 No (1400.55, 2065.52]
4 151.65 Yes (84.61, 267.37]
5 820.5 Yes (552.82, 947.38]
6 1949.4 No (1400.55, 2065.52]
7 301.9 No (267.37, 552.82]
8 3046.05 Yes (2065.52, 3132.75]
9 3487.95 No (3132.75, 4471.44]
10 587.45 No (552.82, 947.38]
11 326.8 No (267.37, 552.82]
12 5681.1 No (4471.44, 5973.69]
13 5036.3 Yes (4471.44, 5973.69]
14 2686.05 No (2065.52, 3132.75]
15 7895.15 No (5973.69, 8684.8]
16 missing No MissingValues
17 7382.25 No (5973.69, 8684.8]
18 528.35 Yes (267.37, 552.82]
.... ....
.... ....
If there wouldn't be missing values it would be easy with this code:
width_bin = (pd.qcut(df.TotalCharges,10))
df = df.assign(TotalChargesCat=width_bin)
df
but since there is 11 missing values I have problems creating categories, and this code leads to error message:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
Simply force the missing to NaN (either by explicit replacement or by forcing to numeric dtype), and then use cut as you had:
df['TotalChargesCategories'] = pd.cut(pd.to_numeric(df['TotalCharges'], errors='coerce'),10)
>>> df
TotalCharges Churn TotalChargesCategories
0 29.85 No (21.985, 816.38]
1 1889.5 No (1602.91, 2389.44]
2 108.15 Yes (21.985, 816.38]
3 1840.75 No (1602.91, 2389.44]
4 151.65 Yes (21.985, 816.38]
5 820.5 Yes (816.38, 1602.91]
6 1949.4 No (1602.91, 2389.44]
7 301.9 No (21.985, 816.38]
8 3046.05 Yes (2389.44, 3175.97]
9 3487.95 No (3175.97, 3962.5]
10 587.45 No (21.985, 816.38]
11 326.8 No (21.985, 816.38]
12 5681.1 No (5535.56, 6322.09]
13 5036.3 Yes (4749.03, 5535.56]
14 2686.05 No (2389.44, 3175.97]
15 7895.15 No (7108.62, 7895.15]
16 missing No NaN
17 7382.25 No (7108.62, 7895.15]
18 528.35 Yes (21.985, 816.38]

Scale values of a particular column of python dataframe between 1-10

I have a dataframe which contains youtube videos views, I want to scale these values in the range of 1-10.
Below is the sample of how values look like? How do i normalize it in the range of 1-10 or is there any more efficient way to do this thing?
rating
4394029
274358
473691
282858
703750
255967
3298456
136643
796896
2932
220661
48688
4661584
2526119
332176
7189818
322896
188162
157437
1153128
788310
1307902
One possibility is performing a scaling with max.
1 + df / df.max() * 9
rating
0 6.500315
1 1.343433
2 1.592952
3 1.354073
4 1.880933
5 1.320412
6 5.128909
7 1.171046
8 1.997531
9 1.003670
10 1.276217
11 1.060946
12 6.835232
13 4.162121
14 1.415808
15 10.000000
16 1.404192
17 1.235536
18 1.197075
19 2.443451
20 1.986783
21 2.637193
Similar solution by Wen (now deleted):
1 + (df - df.min()) * 9 / (df.max() - df.min())
rating
0 6.498887
1 1.339902
2 1.589522
3 1.350546
4 1.877621
5 1.316871
6 5.126922
7 1.167444
8 1.994266
9 1.000000
10 1.272658
11 1.057299
12 6.833941
13 4.159739
14 1.412306
15 10.000000
16 1.400685
17 1.231960
18 1.193484
19 2.440368
20 1.983514
21 2.634189

Break Existing Dataframe Apart Based on Multi Index

I have an existing dataframe that is sorted like this:
In [3]: result_GB_daily_average
Out[3]:
NREL Avert
Month Day
1 1 14.718417 37.250000
2 40.381167 45.250000
3 42.512646 40.666667
4 12.166896 31.583333
5 14.583208 50.416667
6 34.238000 45.333333
7 45.581229 29.125000
8 60.548479 27.916667
9 48.061583 34.041667
10 20.606958 37.583333
11 5.418833 70.833333
12 51.261375 43.208333
13 21.796771 42.541667
14 27.118979 41.958333
15 8.230542 43.625000
16 14.233958 48.708333
17 28.345875 51.125000
18 43.896375 55.500000
19 95.800542 44.500000
20 53.763104 39.958333
21 26.171437 50.958333
22 20.372688 66.916667
23 20.594042 42.541667
24 16.889083 48.083333
25 16.416479 42.125000
26 28.459625 40.125000
27 1.055229 49.833333
28 36.798792 42.791667
29 27.260083 47.041667
30 23.584917 55.750000
... ... ...
12 2 34.491604 55.916667
3 26.444333 53.458333
4 15.088333 45.000000
5 10.213500 32.083333
6 19.087688 17.000000
7 23.078292 17.375000
8 41.523667 29.458333
9 17.173854 37.833333
10 11.488687 52.541667
11 15.203479 30.000000
12 8.390917 37.666667
13 70.067062 23.458333
14 24.281729 25.583333
15 31.826104 33.458333
16 5.085271 42.916667
17 3.778229 46.916667
18 31.276958 57.625000
19 7.399458 46.916667
20 18.531958 39.291667
21 26.831937 35.958333
22 55.514000 32.375000
23 24.018875 34.041667
24 54.454125 43.083333
25 57.379812 25.250000
26 94.520833 33.958333
27 49.693854 27.500000
28 2.406438 46.916667
29 7.133833 53.916667
30 7.829167 51.500000
31 5.584646 55.791667
I would like to split this dataframe apart into 12 different data frames, one for each month, but the problem is they are all slightly different lengths because the amount of days in a month vary, meaning that attempts at using np.array_split have failed. How can I split this based on the Month index?
One solution :
df=result_GB_daily_average
[df.iloc[df.index.get_level_values('Month')==i+1] for i in range(12)]
or, shorter:
[df.ix[i] for i in range(12)]

Nested if loop with DataFrame is very,very slow

I have 10 million rows to go through and it will take many hours to process, I must be doing something wrong
I converted the names of my df variables for ease in typing
Close=df['Close']
eqId=df['eqId']
date=df['date']
IntDate=df['IntDate']
expiry=df['expiry']
delta=df['delta']
ivMid=df['ivMid']
conf=df['conf']
The below code works fine, just ungodly slow, any suggestions?
print(datetime.datetime.now().time())
for i in range(2,1000):
if delta[i]==90:
if delta[i-1]==50:
if delta[i-2]==10:
if expiry[i]==expiry[i-2]:
df.Skew[i]=ivMid[i]-ivMid[i-2]
print(datetime.datetime.now().time())
14:02:11.014396
14:02:13.834275
df.head(100)
Close eqId date IntDate expiry delta ivMid conf Skew
0 37.380005 7 2008-01-02 39447 1 50 0.3850 0.8663
1 37.380005 7 2008-01-02 39447 1 90 0.5053 0.7876
2 36.960007 7 2008-01-03 39448 1 50 0.3915 0.8597
3 36.960007 7 2008-01-03 39448 1 90 0.5119 0.7438
4 35.179993 7 2008-01-04 39449 1 50 0.4055 0.8454
5 35.179993 7 2008-01-04 39449 1 90 0.5183 0.7736
6 33.899994 7 2008-01-07 39452 1 50 0.4464 0.8400
7 33.899994 7 2008-01-07 39452 1 90 0.5230 0.7514
8 31.250000 7 2008-01-08 39453 1 10 0.4453 0.7086
9 31.250000 7 2008-01-08 39453 1 50 0.4826 0.8246
10 31.250000 7 2008-01-08 39453 1 90 0.5668 0.6474 0.1215
11 30.830002 7 2008-01-09 39454 1 10 0.4716 0.7186
12 30.830002 7 2008-01-09 39454 1 50 0.4963 0.8479
13 30.830002 7 2008-01-09 39454 1 90 0.5735 0.6704 0.1019
14 31.460007 7 2008-01-10 39455 1 10 0.4254 0.6737
15 31.460007 7 2008-01-10 39455 1 50 0.4929 0.8218
16 31.460007 7 2008-01-10 39455 1 90 0.5902 0.6411 0.1648
17 30.699997 7 2008-01-11 39456 1 10 0.4868 0.7183
18 30.699997 7 2008-01-11 39456 1 50 0.4965 0.8411
19 30.639999 7 2008-01-14 39459 1 10 0.5117 0.7620
20 30.639999 7 2008-01-14 39459 1 50 0.4989 0.8804
21 30.639999 7 2008-01-14 39459 1 90 0.5887 0.6845 0.077
22 29.309998 7 2008-01-15 39460 1 10 0.4956 0.7363
23 29.309998 7 2008-01-15 39460 1 50 0.5054 0.8643
24 30.080002 7 2008-01-16 39461 1 10 0.4983 0.6646
At this rate it will take 7.77 hrs to process
Basically, the whole point of numpy & pandas is to avoid loops like the plague, and do things in a vectorial way. As you noticed, without that, speed is gone.
Let's break your problem into steps.
The Conditions
Here, your your first condition can be written like this:
df.delta == 90
(Note how this compares the entire column at once. This is much much faster than your loop!).
and the second one can be written like this (using shift):
df.delta.shift(1) == 50
The rest of your conditions are similar.
Note that to combine conditions, you need to use parentheses. So, the first two conditions, together, should be written as:
(df.delta == 90) & (df.delta.shift(1) == 50)
You should be able to now write an expression combining all your conditions. Let's call it cond, i.e.,
cond = (df.delta == 90) & (df.delta.shift(1) == 50) & ...
The assignment
To assign things to a new column, use
df['skew'] = ...
We just need to figure out what to put on the right-hand-sign
The Right Hand Side
Since we have cond, we can write the right-hand-side as
np.where(cond, df.ivMid - df.ivMid.shift(2), 0)
What this says is: when condition is true, take the second term; when it's not, take the third term (in this case I used 0, but do whatever you like).
By combining all of this, you should be able to write a very efficient version of your code.

Categories

Resources