Python Pandas Groupby Alternative (Time Series Analysis) - python

Hi guys I'm using pandas to conduct several time series analysis, here is the sample df
data = {'ticker': ['A','A','A','A','A','A','A','B','B','B','B','B','B','B'],
'high': [10.2,10.5,11,12,15,16,10.2,5,6,6.2,5.3,5.6,7.8,6],
'low': [10,10.4,10.5,11,14,15,10,4.8,5.5,6,5,5,7.5,5.8]}
df = pd.DataFrame(data)
I wanna compute 5-day beta for these two stocks with BETA function in TA-Lib and groupby in pandas, here is my code:
def beta(df):
df['beta'] = talib.BETA(df.high, df.low, timeperiod=5)
df['beta_above_1'] = np.where(df.beta > float(1), True, False)
return df
df = df.groupby(['ticker']).apply(beta)
It works and return me this result:
ticker high low beta beta_above_1
0 A 10.2 10.0 NaN False
1 A 10.5 10.4 NaN False
2 A 11.0 10.5 NaN False
3 A 12.0 11.0 NaN False
4 A 15.0 14.0 NaN False
5 A 16.0 15.0 1.151536 True
6 A 10.2 10.0 0.952395 False
7 B 5.0 4.8 NaN False
8 B 6.0 5.5 NaN False
9 B 6.2 6.0 NaN False
10 B 5.3 5.0 NaN False
11 B 5.6 5.0 NaN False
12 B 7.8 7.5 1.182857 True
13 B 6.0 5.8 1.177912 True
However, the required time is a bit long while I apply this approach to over million rows dataframe. I've researched about vectorization to speed up the calculation but I got no idea to improve it.
May I know if there is other faster ways to conduct the same analysis? Thanks a lot!

Related

maximum sum of consecutive n-days using pandas

I've seen solutions in different languages (i.e. SQL, fortran, or C++) which mainly do for loops.
I am hoping that someone can help me solve this task using pandas instead.
If I have a data frame that looks like this.
date pcp sum_count sumcum
7/13/2013 0.1 3.0 48.7
7/14/2013 48.5
7/15/2013 0.1
7/16/2013
8/1/2013 1.5 1.0 1.5
8/2/2013
8/3/2013
8/4/2013 0.1 2.0 3.6
8/5/2013 3.5
9/22/2013 0.3 3.0 26.3
9/23/2013 14.0
9/24/2013 12.0
9/25/2013
9/26/2013
10/1/2014 0.1 11.0
10/2/2014 96.0 135.5
10/3/2014 2.5
10/4/2014 37.0
10/5/2014 9.5
10/6/2014 26.5
10/7/2014 0.5
10/8/2014 25.5
10/9/2014 2.0
10/10/2014 5.5
10/11/2014 5.5
And I was hoping I could do the following:
STEP 1 : create the sum_count column by determining total count of consecutive non-zeros in the 'pcp' column.
STEP 2 : create the sumcum column and calculate the sum of non-consecutive 'pcp'.
STEP 3 : create a pivot table that will look like this:
year max_sum_count
2013 48.7
2014 135.5
BUT!! the max_sum_count is based on the condition when sum_count = 3
I'd appreciate any help! thank you!
UPDATED QUESTION:
I have previously emphasized that the sum_count should only return the maximum consecutive 3 pcps. But I, mistakenly gave the wrong data frame, I had to edit it. Sorry.
The sumcum of 135.5 came from 96.0 + 2.5 + 37.0. It is the maximum consecutive 3 pcps within the sum_count 11.
Thank you
Use:
#filtering + rolling by days
N = 3
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
#test NaNs
m = df['pcp'].isna()
#groups by consecutive non NaNs
df['g'] = m.cumsum()[~m]
#extract years
df['year'] = df.index.year
#filter no NaNs rows
df = df[~m].copy()
#filter rows greater like N
df['sum_count1'] = df.groupby(['g','year'])['g'].transform('size')
df = df[df['sum_count1'].ge(N)].copy()
#get rolling sum per groups per N days
df['sumcum1'] = (df.groupby(['g','year'])
.rolling(f'{N}D')['pcp']
.sum()
.reset_index(level=[0, 1], drop=True))
#get only maximal counts non NaN and consecutive datetimes
#add missing years
r = range(df['year'].min(), df['year'].max() + 1)
df1 = df.groupby('year')['sumcum1'].max().reindex(r).reset_index(name='max_sum_count')
print (df1)
year max_sum_count
0 2013 48.7
1 2014 135.5
First, convert date as a real datetime dtype and create a binary mask which keep rows where pcp is not null. Then you can create groups and compute your variables:
Input data:
>>> df
date pcp
0 7/13/2013 0.1
1 7/14/2013 48.5
2 7/15/2013 0.1
3 7/16/2013 NaN
4 8/1/2013 1.5
5 8/2/2013 NaN
6 8/3/2013 NaN
7 8/4/2013 0.1
8 8/5/2013 3.5
9 9/22/2013 0.3
10 9/23/2013 14.0
11 9/24/2013 12.0
12 9/25/2013 NaN
13 9/26/2013 NaN
14 10/1/2014 0.1
15 10/2/2014 96.0
16 10/3/2014 2.5
17 10/4/2014 37.0
18 10/5/2014 9.5
19 10/6/2014 26.5
20 10/7/2014 0.5
21 10/8/2014 25.5
22 10/9/2014 2.0
23 10/10/2014 5.5
24 10/11/2014 5.5
Code:
df['date'] = pd.to_datetime(df['date'])
mask = df['pcp'].notna()
grp = df.loc[mask, 'date'] \
.ne(df.loc[mask, 'date'].shift().add(pd.Timedelta(days=1))) \
.cumsum()
df = df.join(df.reset_index()
.groupby(grp)
.agg(index=('index', 'first'),
sum_count=('pcp', 'size'),
sumcum=('pcp', 'sum'))
.set_index('index'))
pivot = df.groupby(df['date'].dt.year)['sumcum'].max() \
.rename('max_sum_count').reset_index()
Output results:
>>> df
date pcp sum_count sumcum
0 2013-07-13 0.1 3.0 48.7
1 2013-07-14 48.5 NaN NaN
2 2013-07-15 0.1 NaN NaN
3 2013-07-16 NaN NaN NaN
4 2013-08-01 1.5 1.0 1.5
5 2013-08-02 NaN NaN NaN
6 2013-08-03 NaN NaN NaN
7 2013-08-04 0.1 2.0 3.6
8 2013-08-05 3.5 NaN NaN
9 2013-09-22 0.3 3.0 26.3
10 2013-09-23 14.0 NaN NaN
11 2013-09-24 12.0 NaN NaN
12 2013-09-25 NaN NaN NaN
13 2013-09-26 NaN NaN NaN
14 2014-10-01 0.1 11.0 210.6
15 2014-10-02 96.0 NaN NaN
16 2014-10-03 2.5 NaN NaN
17 2014-10-04 37.0 NaN NaN
18 2014-10-05 9.5 NaN NaN
19 2014-10-06 26.5 NaN NaN
20 2014-10-07 0.5 NaN NaN
21 2014-10-08 25.5 NaN NaN
22 2014-10-09 2.0 NaN NaN
23 2014-10-10 5.5 NaN NaN
24 2014-10-11 5.5 NaN NaN
>>> pivot
date max_sum_count
0 2013 48.7
1 2014 210.6

Sorting value by two columns in Pandas Python

The idea is to sort value by two columns.
Such that, given two column, I am expecting the output something like
Expected output
x y
0 2.0 NaN
1 3.0 NaN
2 4.0 4.1
3 NaN 5.0
4 10.0 NaN
5 24.0 24.7
6 31.0 31.4
However, using the code below
import pandas as pd
import numpy as np
df1 = pd.DataFrame ( {'x': [2, 3, 4, 24, 31, '',10],
'y':['','',4.1,24.7,31.4,5,'']} )
df1.replace(r'^\s*$', np.nan, regex=True,inplace=True)
rslt_df = df1.sort_values ( by=['x', 'y'], ascending=(True, True) )
print(rslt_df)
Produce the following
x y
0 2.0 NaN
1 3.0 NaN
2 4.0 4.1
6 10.0 NaN
3 24.0 24.7
4 31.0 31.4
5 NaN 5.0
Notice that at the last row, the 5.0 of column y is placed at the bottom.
May I know what modification to the code in order to obtained the intended output?
Try sorting by x fillna y, then reindex from those sorted values:
df1.reindex(df1['x'].fillna(df1['y']).sort_values().index).reset_index(drop=True)
To update the df1 variable:
df1 = (
df1.reindex(df1['x'].fillna(df1['y']).sort_values().index)
.reset_index(drop=True)
)
df1:
x y
0 2.0 NaN
1 3.0 NaN
2 4.0 4.1
3 NaN 5.0
4 10.0 NaN
5 24.0 24.7
6 31.0 31.4
with np.sort and argsort:
df1.iloc[np.sort(df1[['x','y']],axis=1)[:,0].argsort()]
x y
0 2.0 NaN
1 3.0 NaN
2 4.0 4.1
5 NaN 5.0
6 10.0 NaN
3 24.0 24.7
4 31.0 31.4

Find a value in a column in function of another column

Assuming that the value exists, how can I for example, create another column "testFinal" in the dataframe where I will have the absolute value of df["test"]- " df["test"] which is 0.2 seconds after "
for example, the first value for testFinal is the absolute value of the difference between 2 and the value 0.2 seconds after -> so 8, the result is abs(2-8) = 6
My goal is to calculate "testFinal"
I don't know if its clear so here is the example
NB : the Timestamp is not homogeneous, so the interval between two values can be different over time
Thanks a lot
Here is the code for the dataframe
df = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10],
'test':[2,22,8,4,5,4,5,3,54,23,89],
'testFinal':[6,18,3,0,0,1,49,20,35,np.NaN,np.NaN]})
First, create a new temporary column temp obtained from converting the Timestamp column to timedelta using pd.to_timedelta, then set this temp column as dataframe index, then create a new column testFinal having the values as this new index + 0.2 seconds, then using Series.map, map the testFinal column to the values from df['test'] column, thus now the testFinal column should have values corresponding to the values in test column 0.2s later, thereafter you can subtract the values in the testFinal and test column to get the desired result:
df['temp'] = pd.to_timedelta(df['Timestamp'], unit='s')
df = df.set_index('temp')
df['testFinal'] = df.index + pd.Timedelta(seconds=0.2)
df['testFinal'] = df['testFinal'].map(df['test']).sub(df['test']).abs()
df = df.reset_index(drop=True)
# print(df)
Timestamp test testFinal
0 11.1 2 6.0
1 11.2 22 18.0
2 11.3 8 3.0
3 11.4 4 0.0
4 11.5 5 0.0
5 11.6 4 1.0
6 11.7 5 49.0
7 11.8 3 20.0
8 11.9 54 35.0
9 12.0 23 NaN
10 12.1 89 NaN
You could use numpy as follows. I created a new column test_final to compare with the expected testFinal column.
import numpy as np
test = df.test.values
df['test_final'] = np.abs(test - np.concatenate((test[2:], np.array([np.nan]*2)), axis=0))
print(df)
Output:
Timestamp test testFinal test_final
0 11.1 2 6.0 6.0
1 11.2 22 18.0 18.0
2 11.3 8 3.0 3.0
3 11.4 4 0.0 0.0
4 11.5 5 0.0 0.0
5 11.6 4 1.0 1.0
6 11.7 5 49.0 49.0
7 11.8 3 20.0 20.0
8 11.9 54 35.0 35.0
9 12.0 23 NaN NaN
10 12.1 89 NaN NaN

Conditional pairwise calculations in pandas

For example, I have 2 dfs:
df1
ID,col1,col2
1,5,9
2,6,3
3,7,2
4,8,5
and another df is
df2
ID,col1,col2
1,11,9
2,12,7
3,13,2
I want to calculate first pairwise subtraction from df2 to df1. I am using scipy.spatial.distance using a function subtract_
def subtract_(a, b):
return abs(a - b)
d1_s = df1[['col1']]
d2_s = df2[['col1']]
dist = cdist(d1_s, d2_s, metric=subtract_)
dist_df = pd.DataFrame(dist, columns= d2_s.values.ravel())
print(dist_df)
11 12 13
6.0 7.0 8.0
5.0 6.0 7.0
4.0 5.0 6.0
3.0 4.0 5.0
Now, I want to check, these new columns name like 11,12 and 13. I am checking if there is any values in this new dataframe less than 5. If there is, then I want to do further calculations. Like this.
For example, here for columns name '11', less than 5 value is 4 which is at rows 3. Now in this case, I want to subtract columns name ('col2') of df1 but at row 3, in this case it would be value 2. I want to subtract this value 2 with df2(col2) but at row 1 (because column name '11') was from value at row 1 in df2.
My for loop is so complex for this. It would be great, if there would be some easier way in pandas.
Any help, suggestions would be great.
The expected new dataframe is this
0,1,2
Nan,Nan,Nan
Nan,Nan,Nan
(2-9)=-7,Nan,Nan
(5-9)=-4,(5-7)=-2,Nan
Similar to Ben's answer, but with np.where:
pd.DataFrame(np.where(dist_df<5, df1.col2.values[:,None] - df2.col2.values, np.nan),
index=dist_df.index,
columns=dist_df.columns)
Output:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN
In your case using numpy with mask
df.mask(df<5,df-(df1.col2.values[:,None]+df2.col2.values))
Out[115]:
11 12 13
0 6.0 7.0 8.0
1 5.0 6.0 7.0
2 -7.0 5.0 6.0
3 -11.0 -8.0 5.0
Update
Newdf=(df-(-df1.col2.values[:,None]+df2.col2.values)-df).where(df<5)
Out[148]:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN

pandas filling nans by mean of before and after non-nan values

I would like to fill df's nan with an average of adjacent elements.
Consider a dataframe:
df = pd.DataFrame({'val': [1,np.nan, 4, 5, np.nan, 10, 1,2,5, np.nan, np.nan, 9]})
val
0 1.0
1 NaN
2 4.0
3 5.0
4 NaN
5 10.0
6 1.0
7 2.0
8 5.0
9 NaN
10 NaN
11 9.0
My desired output is:
val
0 1.0
1 2.5
2 4.0
3 5.0
4 7.5
5 10.0
6 1.0
7 2.0
8 5.0
9 7.0 <<< deadend
10 7.0 <<< deadend
11 9.0
I've looked into other solutions such as Fill cell containing NaN with average of value before and after, but this won't work in case of two or more consecutive np.nans.
Any help is greatly appreciated!
Use ffill + bfill and divide by 2:
df = (df.ffill()+df.bfill())/2
print(df)
val
0 1.0
1 2.5
2 4.0
3 5.0
4 7.5
5 10.0
6 1.0
7 2.0
8 5.0
9 7.0
10 7.0
11 9.0
EDIT : If 1st and last element contains NaN then use (Dark
suggestion):
df = pd.DataFrame({'val':[np.nan,1,np.nan, 4, 5, np.nan,
10, 1,2,5, np.nan, np.nan, 9,np.nan,]})
df = (df.ffill()+df.bfill())/2
df = df.bfill().ffill()
print(df)
val
0 1.0
1 1.0
2 2.5
3 4.0
4 5.0
5 7.5
6 10.0
7 1.0
8 2.0
9 5.0
10 7.0
11 7.0
12 9.0
13 9.0
Althogh in case of multiple nan's in a row it doesn't produce the exact output you specified, other users reaching this page may actually prefer the effect of the method interpolate():
df = df.interpolate()
print(df)
val
0 1.0
1 2.5
2 4.0
3 5.0
4 7.5
5 10.0
6 1.0
7 2.0
8 5.0
9 6.3
10 7.7
11 9.0

Categories

Resources