Nested if loop with DataFrame is very,very slow - python

I have 10 million rows to go through and it will take many hours to process, I must be doing something wrong
I converted the names of my df variables for ease in typing
Close=df['Close']
eqId=df['eqId']
date=df['date']
IntDate=df['IntDate']
expiry=df['expiry']
delta=df['delta']
ivMid=df['ivMid']
conf=df['conf']
The below code works fine, just ungodly slow, any suggestions?
print(datetime.datetime.now().time())
for i in range(2,1000):
if delta[i]==90:
if delta[i-1]==50:
if delta[i-2]==10:
if expiry[i]==expiry[i-2]:
df.Skew[i]=ivMid[i]-ivMid[i-2]
print(datetime.datetime.now().time())
14:02:11.014396
14:02:13.834275
df.head(100)
Close eqId date IntDate expiry delta ivMid conf Skew
0 37.380005 7 2008-01-02 39447 1 50 0.3850 0.8663
1 37.380005 7 2008-01-02 39447 1 90 0.5053 0.7876
2 36.960007 7 2008-01-03 39448 1 50 0.3915 0.8597
3 36.960007 7 2008-01-03 39448 1 90 0.5119 0.7438
4 35.179993 7 2008-01-04 39449 1 50 0.4055 0.8454
5 35.179993 7 2008-01-04 39449 1 90 0.5183 0.7736
6 33.899994 7 2008-01-07 39452 1 50 0.4464 0.8400
7 33.899994 7 2008-01-07 39452 1 90 0.5230 0.7514
8 31.250000 7 2008-01-08 39453 1 10 0.4453 0.7086
9 31.250000 7 2008-01-08 39453 1 50 0.4826 0.8246
10 31.250000 7 2008-01-08 39453 1 90 0.5668 0.6474 0.1215
11 30.830002 7 2008-01-09 39454 1 10 0.4716 0.7186
12 30.830002 7 2008-01-09 39454 1 50 0.4963 0.8479
13 30.830002 7 2008-01-09 39454 1 90 0.5735 0.6704 0.1019
14 31.460007 7 2008-01-10 39455 1 10 0.4254 0.6737
15 31.460007 7 2008-01-10 39455 1 50 0.4929 0.8218
16 31.460007 7 2008-01-10 39455 1 90 0.5902 0.6411 0.1648
17 30.699997 7 2008-01-11 39456 1 10 0.4868 0.7183
18 30.699997 7 2008-01-11 39456 1 50 0.4965 0.8411
19 30.639999 7 2008-01-14 39459 1 10 0.5117 0.7620
20 30.639999 7 2008-01-14 39459 1 50 0.4989 0.8804
21 30.639999 7 2008-01-14 39459 1 90 0.5887 0.6845 0.077
22 29.309998 7 2008-01-15 39460 1 10 0.4956 0.7363
23 29.309998 7 2008-01-15 39460 1 50 0.5054 0.8643
24 30.080002 7 2008-01-16 39461 1 10 0.4983 0.6646
At this rate it will take 7.77 hrs to process

Basically, the whole point of numpy & pandas is to avoid loops like the plague, and do things in a vectorial way. As you noticed, without that, speed is gone.
Let's break your problem into steps.
The Conditions
Here, your your first condition can be written like this:
df.delta == 90
(Note how this compares the entire column at once. This is much much faster than your loop!).
and the second one can be written like this (using shift):
df.delta.shift(1) == 50
The rest of your conditions are similar.
Note that to combine conditions, you need to use parentheses. So, the first two conditions, together, should be written as:
(df.delta == 90) & (df.delta.shift(1) == 50)
You should be able to now write an expression combining all your conditions. Let's call it cond, i.e.,
cond = (df.delta == 90) & (df.delta.shift(1) == 50) & ...
The assignment
To assign things to a new column, use
df['skew'] = ...
We just need to figure out what to put on the right-hand-sign
The Right Hand Side
Since we have cond, we can write the right-hand-side as
np.where(cond, df.ivMid - df.ivMid.shift(2), 0)
What this says is: when condition is true, take the second term; when it's not, take the third term (in this case I used 0, but do whatever you like).
By combining all of this, you should be able to write a very efficient version of your code.

Related

fastest way to access dataframe cell by colums values?

I have the following dataframe :
time bk1_lvl0_id bk2_lvl0_id pr_ss order_upto_level initial_inventory leadtime1 leadtime2 adjusted_leadtime
0 2020 1000 3 16 18 17 3 0.100000 1
1 2020 10043 3 65 78 72 12 0.400000 1
2 2020 1005 3 0 1 1 9 0.300000 1
3 2020 1009 3 325 363 344 21 0.700000 1
4 2020 102 3 0 1 1 7 0.233333 1
I want a function to get the pr_ss for example for (bk1_lvl0_id=1000,bk2_lvl0_id=3).
that's the code i've tried but it takes time :
def get_safety_stock(df,bk1,bk2):
##a function that returns the safety stock for any given (bk1,bk2)
for index,row in df.iterrows():
if (row["bk1_lvl0_id"]==bk1) and (row["bk2_lvl0_id"]==bk2):
return int(row["pr_ss"])
break
If your dataframe has no duplicate values based on bk1_lvl0_id and bk2_lvl0_id, You can make function as follows:
def get_safety_stock(df,bk1,bk2):
return df.loc[df.bk1_lvl0_id.eq(bk1) & df.bk2_lvl0_id.eq(bk2), 'pr_ss'][0]
Note that its accessing the first value in the Series which shouldnt be an issue if there are no duplicates in data. If you want all of them, just remove the [0] from the end and it should give you the whole series. This can be called as follows:
get_safety_stock(df, 1000,3)
>>>16

Sort pandas dataframe column based on substring

I have a pandas dataframe, as shown below:
Timestamp_Start Event_ID Duration
555.54944 Fix_1 0.42248
559.07281 Fix_10 0.01996
559.14642 Fix_11 0
556.03192 Fix_2 0.16113
556.27985 Fix_3 0.24188
556.56097 Fix_4 0.04987
556.65497 Fix_5 0.10748
556.80859 Fix_6 0.75708
557.57983 Fix_7 0.11329
557.75348 Fix_8 0.65643
558.43665 Fix_9 0.27447
555.97925 Sac_1 0.04577
559.09961 Sac_10 0.0404
559.15302 Sac_11 0.00726
556.19916 Sac_2 0.07403
556.52747 Sac_3 0.02789
556.61865 Sac_4 0.02985
556.76849 Sac_5 0.0337
557.57294 Sac_6 0
557.69965 Sac_7 0.04687
558.41632 Sac_8 0.01325
558.71796 Sac_9 0.34552
I want to sort the 'Event_ID' column, so that Fix_1,Fix_2,Fix_3... and Sac_1,Sac_2,Sac_3... appear in order, like below:
Timestamp_StartEvent_ID Duration
555.54944 Fix_1 0.42248
556.03192 Fix_2 0.16113
556.27985 Fix_3 0.24188
556.56097 Fix_4 0.04987
556.65497 Fix_5 0.10748
556.80859 Fix_6 0.75708
557.57983 Fix_7 0.11329
557.75348 Fix_8 0.65643
558.43665 Fix_9 0.27447
559.07281 Fix_10 0.01996
559.14642 Fix_11 0
555.97925 Sac_1 0.04577
556.19916 Sac_2 0.07403
556.52747 Sac_3 0.02789
556.61865 Sac_4 0.02985
556.76849 Sac_5 0.0337
557.57294 Sac_6 0
557.69965 Sac_7 0.04687
558.41632 Sac_8 0.01325
558.71796 Sac_9 0.34552
559.09961 Sac_10 0.0404
559.15302 Sac_11 0.00726
Any ideas on how to do that? Thanks for your help.
One way using distutils.version:
import numpy as np
from distutils.version import LooseVersion
f = np.vectorize(LooseVersion)
new_df = df.sort_values("Event_ID", key=f)
print(new_df)
Output:
Timestamp_Start Event_ID Duration
0 555.54944 Fix_1 0.42248
3 556.03192 Fix_2 0.16113
4 556.27985 Fix_3 0.24188
5 556.56097 Fix_4 0.04987
6 556.65497 Fix_5 0.10748
7 556.80859 Fix_6 0.75708
8 557.57983 Fix_7 0.11329
9 557.75348 Fix_8 0.65643
10 558.43665 Fix_9 0.27447
1 559.07281 Fix_10 0.01996
2 559.14642 Fix_11 0.00000
11 555.97925 Sac_1 0.04577
14 556.19916 Sac_2 0.07403
15 556.52747 Sac_3 0.02789
16 556.61865 Sac_4 0.02985
17 556.76849 Sac_5 0.03370
18 557.57294 Sac_6 0.00000
19 557.69965 Sac_7 0.04687
20 558.41632 Sac_8 0.01325
21 558.71796 Sac_9 0.34552
12 559.09961 Sac_10 0.04040
13 559.15302 Sac_11 0.00726
Normal sorting on the dataframe will not work, as you need the integer in the string to be treated as int value.
It can be done with extra space though.
You can make two columns like this,
df['event'] = df.Event_ID.str.rsplit("_").str[0]
df['idx'] = df.Event_ID.str.rsplit("_").str[-1].astype(int)
Now, sort on these two columns,
df.sort_values(['event', 'idx'])
Timestamp_Start Event_ID Duration idx event
0 555.54944 Fix_1 0.42248 1 Fix
3 556.03192 Fix_2 0.16113 2 Fix
4 556.27985 Fix_3 0.24188 3 Fix
5 556.56097 Fix_4 0.04987 4 Fix
6 556.65497 Fix_5 0.10748 5 Fix
7 556.80859 Fix_6 0.75708 6 Fix
8 557.57983 Fix_7 0.11329 7 Fix
9 557.75348 Fix_8 0.65643 8 Fix
10 558.43665 Fix_9 0.27447 9 Fix
1 559.07281 Fix_10 0.01996 10 Fix
2 559.14642 Fix_11 0.00000 11 Fix
11 555.97925 Sac_1 0.04577 1 Sac
14 556.19916 Sac_2 0.07403 2 Sac
15 556.52747 Sac_3 0.02789 3 Sac
16 556.61865 Sac_4 0.02985 4 Sac
17 556.76849 Sac_5 0.03370 5 Sac
18 557.57294 Sac_6 0.00000 6 Sac
19 557.69965 Sac_7 0.04687 7 Sac
20 558.41632 Sac_8 0.01325 8 Sac
21 558.71796 Sac_9 0.34552 9 Sac
12 559.09961 Sac_10 0.04040 10 Sac
13 559.15302 Sac_11 0.00726 11 Sac
You can reset_index, drop extra columns as needed

Scale values of a particular column of python dataframe between 1-10

I have a dataframe which contains youtube videos views, I want to scale these values in the range of 1-10.
Below is the sample of how values look like? How do i normalize it in the range of 1-10 or is there any more efficient way to do this thing?
rating
4394029
274358
473691
282858
703750
255967
3298456
136643
796896
2932
220661
48688
4661584
2526119
332176
7189818
322896
188162
157437
1153128
788310
1307902
One possibility is performing a scaling with max.
1 + df / df.max() * 9
rating
0 6.500315
1 1.343433
2 1.592952
3 1.354073
4 1.880933
5 1.320412
6 5.128909
7 1.171046
8 1.997531
9 1.003670
10 1.276217
11 1.060946
12 6.835232
13 4.162121
14 1.415808
15 10.000000
16 1.404192
17 1.235536
18 1.197075
19 2.443451
20 1.986783
21 2.637193
Similar solution by Wen (now deleted):
1 + (df - df.min()) * 9 / (df.max() - df.min())
rating
0 6.498887
1 1.339902
2 1.589522
3 1.350546
4 1.877621
5 1.316871
6 5.126922
7 1.167444
8 1.994266
9 1.000000
10 1.272658
11 1.057299
12 6.833941
13 4.159739
14 1.412306
15 10.000000
16 1.400685
17 1.231960
18 1.193484
19 2.440368
20 1.983514
21 2.634189

Filtering pandas dataframe for a steady speed condition

Below is a sample dataframe which is similar to mine except the one I am working on has 200,000 data points.
import pandas as pd
import numpy as np
df=pd.DataFrame([
[10.07,5], [10.24,5], [12.85,5], [11.85,5],
[11.10,5], [14.56,5], [14.43,5], [14.85,5],
[14.95,5], [10.41,5], [15.20,5], [15.47,5],
[15.40,5], [15.31,5], [15.43,5], [15.65,5]
], columns=['speed','delta_t'])
df
speed delta_t
0 10.07 5
1 10.24 5
2 12.85 5
3 11.85 5
4 11.10 5
5 14.56 5
6 14.43 5
7 14.85 5
8 14.95 5
9 10.41 5
10 15.20 5
11 15.47 5
12 15.40 5
13 15.31 5
14 15.43 5
15 15.65 5
std_dev = df.iloc[0:3,0].std() # this will give 1.55
print(std_dev)
I have 2 columns, 'Speed' and 'Delta_T'. Delta_T is the difference in time between subsequent rows in my actual data (it has date and time). The operating speed keeps varying and what I want to achieve is to filter out all data points where the speed is nearly steady, say by filtering for a standard deviations of < 0.5 and Delta_T >=15 min. For example, if we start with the first speed, the code should be able to keep jumping to the next speeds, keep calculating the standard deviation and if it less than 0.5 and it delta_T sums up to 30 min and more I should be copy that data into a new dataframe.
So for this dataframe I will be left with index 5 to 8 and 10 to15.
Is this possible? Could you please give me some suggestion on how to do it? Sorry I am stuck. It seems to complicated to me.
Thank you.
Best Regards Arun
Let use rolling,shift and std:
Calculate the rolling std for a window of 3, the find those stds less than 0.5 and use shift(-2) to get the values at the start of the window where std was less than 0.5. Using boolean indexing with |(or) we can get the entire steady state range.
df_std = df['speed'].rolling(3).std()
df_ss = df[(df_std < 0.5) | (df_std < 0.5).shift(-2)]
df_ss
Output:
speed delta_t
5 14.56 5
6 14.43 5
7 14.85 5
8 14.95 5
10 15.20 5
11 15.47 5
12 15.40 5
13 15.31 5
14 15.43 5
15 15.65 5

Pandas appending Series to DataFrame to write to a file

I have list of Dataframes that I want to compute the mean on
~ pieces[1].head()
Sample Label C_RUNTIMEN N_TQ N_TR ... N_GEAR1 N_GEAR2 N_GEAR3 \
301 manual 82.150833 7 69 ... 3.615 1.952 1.241
302 manual 82.150833 7 69 ... 3.615 1.952 1.241
303 manual 82.150833 7 69 ... 3.615 1.952 1.241
304 manual 82.150833 7 69 ... 3.615 1.952 1.241
305 manual 82.150833 7 69 ... 3.615 1.952 1.241
, So i am looping through them ->
pieces = np.array_split(df,size)
output = pd.DataFrame()
for piece in pieces:
dp = piece.mean()
output = output.append(dp,ignore_index=True)
Unfortunately the output is sorted (the column names are alphabetical in the output) and I want to keep the original column order (as seen up top).
~ output.head()
C_ABSHUM C_ACCFUELGALN C_AFR C_AFRO C_FRAIRWS C_GEARRATIO \
0 44.578937 66.183858 14.466816 14.113321 18.831117 6.677792
1 34.042593 66.231229 14.320409 14.113321 22.368983 6.677792
2 34.497194 66.309320 14.210066 14.113321 25.353414 6.677792
3 43.430931 66.376632 14.314854 14.113321 28.462130 6.677792
4 44.419204 66.516515 14.314653 14.113321 32.244107 6.677792
I have tried variations of concat etc with no success. Is there a different way to think about this ?
My recommendation would be to concat the list of dataframes using pd.concat. This will allow you to use the standard group-by/apply. In this example, multi_df is a MultiIndex which behaves like a standard data frame, only the indexing and group by is a little different:
x = []
for i in range(10):
x.append(pd.DataFrame(dict(zip(list('abc'), [i + 1, i + 2, i + 3])), index = list('ind')))
Now x contains a list of data frames of the shape
a b c
i 2 3 4
n 2 3 4
d 2 3 4
And with
multi_df = pd.concat(x, keys = range(len(x)))
result = multi_df.groupby(level = [0]).apply(np.mean)
we get a data frame that looks like
a b c
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
8 9 10 11
9 10 11 12
You can then just call result.to_csv('filepath') to write that out.

Categories

Resources