python pandas dataframe equivalent function logic for nonzero value calculation - python

This is a pinescript code which I am trying to code in python.what shall be an optimized equivalent python code be for the same
kama[1] here is previous kama value, for first-time calc in an array what should be done for this kama[1] value as it would not exist the first time.
kama=nz(kama[1], close[1])+smooth*(close[1]-nz(kama[1], close[1]))
pinescript info :
nz
Replaces NaN values with zeros (or given value) in a series.
nz(x, y) → integer
nz(sma(close, 100))
RETURNS
Two args version: returns x if it's a valid (not NaN) number, otherwise y
One arg version: returns x if it's a valid (not NaN) number, otherwise 0
ARGUMENTS
x (series) Series of values to process.
y (float) Value that will be inserted instead of all NaN values in x series.
edit 1
something i tried as below that is not working
stockdata['kama'] = stockdata['kama'](-1) if stockdata['kama'](-1) !=0 \
else stockdata['close'] + stockdata['smooth']*(stockdata['close'] - \
stockdata['kama'](-1) if stockdata['kama'](-1) !=0 else stockdata['close'])
edit 2
the alternative i tried just to make sure at least one part is working but that is also failing (nz(kama[1], close))
stockdata['kama'] = np.where(stockdata['kama'][-1] != 0, stockdata['kama'][-1], stockdata['close'])
completely struck now if this line of
kama=nz(kama[1], close)+smooth*(close-nz(kama[1], close))
pine-script code not converted to python my whole logic will go for a toss. any of your working solutions are greatly appreciated.
edit 3:
the dataframe input of the series
open high low close adjusted_close \
date
2002-07-01 5.2397 5.5409 5.2397 5.4127 0.0634
2002-07-02 5.5234 5.5370 5.4214 5.4438 0.0638
2002-07-03 5.5060 5.5458 5.3281 5.4661 0.0640
2002-07-04 5.5011 5.5720 5.4175 5.5283 0.0647
2002-07-05 5.5633 5.6566 5.4749 5.5905 0.0655
2002-07-08 5.5011 5.7187 5.5011 5.6255 0.0659
2002-07-09 5.5905 5.7586 5.5681 5.6167 0.0658
2002-07-10 5.4885 5.4885 5.1465 5.2222 0.0612
2002-07-11 4.9784 5.2135 4.9784 5.1863 0.0607
2002-07-12 5.5011 5.5011 5.2446 5.3194 0.0623
2002-07-15 5.3243 5.4797 5.1912 5.3330 0.0625
2002-07-16 5.1999 5.4389 5.1999 5.3155 0.0623
2002-07-17 4.7024 5.1377 4.6189 5.0445 0.0591
2002-07-18 4.8803 5.1465 4.8356 5.0804 0.0595
2002-07-19 5.0270 5.2038 5.0221 5.1513 0.0603
2002-07-22 5.0804 5.1465 4.9687 4.9735 0.0582
2002-07-23 4.8181 5.0843 4.8181 5.0619 0.0593
2002-07-24 5.0580 5.1290 4.9376 5.0619 0.0593
2002-07-25 5.0580 5.0580 4.7918 4.8492 0.0568
volume dividend_amount split_coefficient Om \
date
2002-07-01 21923 0.0 1.0 NaN
2002-07-02 61045 0.0 1.0 NaN
2002-07-03 34161 0.0 1.0 NaN
2002-07-04 27893 0.0 1.0 NaN
2002-07-05 58976 0.0 1.0 NaN
2002-07-08 48910 0.0 1.0 5.472433
2002-07-09 321846 0.0 1.0 5.530900
2002-07-10 138434 0.0 1.0 5.525083
2002-07-11 15027 0.0 1.0 5.437150
2002-07-12 24187 0.0 1.0 5.437150
2002-07-15 50330 0.0 1.0 5.397317
2002-07-16 24928 0.0 1.0 5.347117
2002-07-17 21357 0.0 1.0 5.199100
2002-07-18 27532 0.0 1.0 5.097733
2002-07-19 13380 0.0 1.0 5.105833
2002-07-22 21666 0.0 1.0 5.035717
2002-07-23 40161 0.0 1.0 4.951350
2002-07-24 34480 0.0 1.0 4.927700
2002-07-25 38185 0.0 1.0 4.986967
Hm Lm Cm vClose diff \
date
2002-07-01 NaN NaN NaN NaN 1669.8373
2002-07-02 NaN NaN NaN NaN 1669.8062
2002-07-03 NaN NaN NaN NaN 1669.7839
2002-07-04 NaN NaN NaN NaN 1669.7217
2002-07-05 NaN NaN NaN NaN 1669.6595
2002-07-08 5.595167 5.397117 5.511150 5.493967 1669.6245
2002-07-09 5.631450 5.451850 5.545150 5.539837 1669.6333
2002-07-10 5.623367 5.406033 5.508217 5.515675 1670.0278
2002-07-11 5.567983 5.347750 5.461583 5.453617 1670.0637
2002-07-12 5.556167 5.318933 5.426767 5.434754 1669.9306
2002-07-15 5.526683 5.271650 5.383850 5.394875 1669.9170
2002-07-16 5.480050 5.221450 5.332183 5.345200 1669.9345
2002-07-17 5.376567 5.063250 5.236817 5.218933 1670.2055
2002-07-18 5.319567 5.011433 5.213183 5.160479 1670.1696
2002-07-19 5.317950 5.018717 5.207350 5.162463 1670.0987
2002-07-22 5.258850 4.972733 5.149700 5.104250 1670.2765
2002-07-23 5.192950 4.910550 5.104517 5.039842 1670.1881
2002-07-24 5.141300 4.866833 5.062250 4.999521 1670.1881
2002-07-25 5.128017 4.895650 5.029700 5.010083 1670.4008
signal noise efratio smooth
date
2002-07-01 5.4127 1670.3373 0.003240 0.416113
2002-07-02 5.4438 1670.3062 0.003259 0.416113
2002-07-03 5.4661 1670.2839 0.003273 0.416114
2002-07-04 5.5283 1670.2217 0.003310 0.416115
2002-07-05 5.5905 1670.1595 0.003347 0.416116
2002-07-08 5.6255 1670.1245 0.003368 0.416116
2002-07-09 5.6167 1670.1333 0.003363 0.416116
2002-07-10 5.2222 1670.5278 0.003126 0.416110
2002-07-11 5.1863 1670.5637 0.003105 0.416109
2002-07-12 5.3194 1670.4306 0.003184 0.416111
2002-07-15 5.3330 1670.4170 0.003193 0.416111
2002-07-16 5.3155 1670.4345 0.003182 0.416111
2002-07-17 5.0445 1670.7055 0.003019 0.416107
2002-07-18 5.0804 1670.6696 0.003041 0.416107
2002-07-19 5.1513 1670.5987 0.003084 0.416109
2002-07-22 4.9735 1670.7765 0.002977 0.416106
2002-07-23 5.0619 1670.6881 0.003030 0.416107
2002-07-24 5.0619 1670.6881 0.003030 0.416107
2002-07-25 4.8492 1670.9008 0.002902 0.416104
what is expected for kama=nz(kama[1], close)+smooth*(close-nz(kama[1], close))?
stockdata['kama'] =nz(stockdata[kama][-1],stockdata['close'] +stockdata['smooth']*(stockdata['close']-nz(stockdata['kama'][-1],stockdata['close'])
in this case for first iteration there will not be any previous kama value which should be taken care. all the inputs are given in dataframe format above.

You need to create the column kama first with the values of close:
import numpy as np
stockdata['kama'] = stockdata['close']
previous_kama = stockdata['kama'].shift()
previous_close = stockdata['close'].shift()
value = np.where(previous_kama.notnull(), previous_kama, previous_close)
stockdata['kama'] = value + stockdata['smooth'] * (previous_close - value)

Related

How to count values over different thresholds per column in Pandas dataframe groupby?

My goal is for a dataset similar to the example below, to group by [s_num, ip, f_num, direction] and then filter the score columns using separate thresholds and count how many values are above the threshold.
id s_num ip f_num direction algo_1_x algo_2_x algo_1_score algo_2_score
0 0.0 0.0 0.0 0.0 X -4.63 -4.45 0.624356 0.664009
15 19.0 0.0 2.0 0.0 X -5.44 -5.02 0.411217 0.515843
16 20.0 0.0 2.0 0.0 X -12.36 -5.09 0.397237 0.541112
20 24.0 0.0 2.0 1.0 X -4.94 -5.15 0.401744 0.526032
21 25.0 0.0 2.0 1.0 X -4.78 -4.98 0.386410 0.564934
22 26.0 0.0 2.0 1.0 X -4.89 -5.03 0.394326 0.513896
24 28.0 0.0 2.0 2.0 X -4.78 -5.00 0.420078 0.521993
25 29.0 0.0 2.0 2.0 X -4.91 -5.14 0.407355 0.485878
26 30.0 0.0 2.0 2.0 X 11.83 -4.97 0.392242 0.659122
27 31.0 0.0 2.0 2.0 X -4.73 -5.07 0.377011 0.524774
​
the result should look something like:
Each entry in algo_i column is the # of values in the group larger than the corresponding threshold
So far I tried first grouping, and applying custom aggregation like so:
def count_success(x,thresh):
return ((x > thresh)*1).sum()
thresholds=[0.1,0.2]
df.groupby(attr_cols).agg({f'algo_{i+1}_score':count_success(thresh) for i, thresh in enumerate(thresholds)})
but this results in an error :
count_success() missing 1 required positional argument: 'thresh'
So, how can I pass another argument to a function using .agg( )? or is there an easier way to do it using some pandas function?
Named aggregation does not allow extra parameter to be passed to your function. You can use numpy boardcasting:
attr_cols = ["s_num", "ip", "f_num", "direction"]
score_cols = df.columns[df.columns.str.match("algo_\d+_score")]
# Convert everything to numpy to prepare for broadcasting
score = df[score_cols].to_numpy()
threshold = np.array([0.1, 0.5])
# Raise `threshold` up 2 dimensions so that every value in `score` is
# broadcast against every value in `threshold`
mask = score > threshold[:, None, None]
# Assemble the result
row_index = pd.MultiIndex.from_frame(df[attr_cols])
col_index = pd.MultiIndex.from_product([threshold, score_cols], names=["threshold", "algo"])
result = (
pd.DataFrame(np.hstack(mask), index=row_index, columns=col_index)
.groupby(attr_cols)
.sum()
)
Result:
threshold 0.1 0.5
algo algo_1_score algo_2_score algo_1_score algo_2_score
s_num ip f_num direction
0.0 0.0 0.0 X 1 1 1 1
2.0 0.0 X 2 2 0 2
1.0 X 3 3 0 3
2.0 X 4 4 0 3

pandas crosstab simplified view of multiple columns

I already referred the posts here and here. Don't mark it as duplicate
I have a dataframe like as below
id,status,country,amount,qty
1,pass,USA,123,4500
1,pass,USA,156,3210
1,fail,UK,687,2137
1,fail,UK,456,1236
2,pass,AUS,216,324
2,pass,AUS,678,241
2,nan,ANZ,637,213
2,pass,ANZ,213,543
sf = pd.read_clipboard(sep=',')
I would like to get the percentage of values from each column as a seperate column
So, with the help of this post, I tried the below
Approach - 1 Doesn't give expected output shape
(pd.crosstab(sf['id'],[sf['status'].fillna('nan'),sf['country'].fillna('nan')],normalize=0)
.drop('nan', 1)
.mul(100)).reset_index()
Approach - 2 - Doesn't give expected output
sf_inv= sf.melt()
pd.crosstab(sf_inv.value, sf_inv.variable)
I expect my output to be like as below
You can use crosstab with normalize='index' on your different columns and concat the results:
pd.concat([pd.crosstab(sf['id'], sf[c], normalize='index')
for c in ['status', 'country']], axis=1).mul(100).add_suffix('_pct')
output:
fail_pct pass_pct ANZ_pct AUS_pct UK_pct USA_pct
id
1 50.0 50.0 0.0 0.0 50.0 50.0
2 0.0 100.0 50.0 50.0 0.0 0.0
handling NaNs:
pd.concat([pd.crosstab(sf['id'], sf[c].fillna('NA'), normalize='index')
.drop(columns='NA', errors='ignore')
for c in ['status', 'country']], axis=1).mul(100).add_suffix('_pct')
output:
fail_pct pass_pct ANZ_pct AUS_pct UK_pct USA_pct
id
1 50.0 50.0 0.0 0.0 50.0 50.0
2 0.0 75.0 50.0 50.0 0.0 0.0

Multi-index calculation to new columns

I have a dataframe like this.
status new allocation
asset csh fi eq csh fi eq
person act_type
p1 inv 0.0 0.0 100000.0 0.0 0.0 1.0
rsp 0.0 30000.0 20000.0 0.0 0.6 0.4
tfsa 10000.0 40000.0 0.0 0.2 0.8 0.0
The right three columns are percent of total for each act_type. The following does calculate the columns correctly:
# set the percent allocations
df.loc[idx[:,:],idx["allocation",'csh']] = df.loc[idx[:,:],idx["new",'csh']] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
df.loc[idx[:,:],idx["allocation",'fi']] = df.loc[idx[:,:],idx["new",'fi']] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
df.loc[idx[:,:],idx["allocation",'eq']] = df.loc[idx[:,:],idx["new",'eq']] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
I have tried to do these calculations on one line combining 'csh', 'fi', 'eq' as follows:
df.loc[idx[:,:],idx["new", ('csh', 'fi', 'eq')]] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
But this results in ValueError: cannot join with no level specified and no overlapping names
Any suggestions how I can reduce these three lines to one line of code so that i'm dividing ('csh','fi','eq') by the account total and getting percents in the next columns?
First idx[:,:] should be simplify by :, then use DataFrame.div by axis=0 and for new columns use rename with DataFrame.join:
df1=df.loc[:, idx["new",('csh', 'fi', 'eq')]].div(df.loc[:, idx["new",:]].sum(axis=1),axis=0)
df = df.join(df1.rename(columns={'new':'allocation'}, level=0))
print (df)
status new allocation
asset csh fi eq csh fi eq
person act_type
p1 inv 0.0 0.0 100000.0 0.0 0.0 1.0
rsp 0.0 30000.0 20000.0 0.0 0.6 0.4
tfsa 10000.0 40000.0 0.0 0.2 0.8 0.0

pandas adding column values in a loop without using iloc

I want to add (ideally get the mean of the sum) of several column values starting with my index i,
investmentlength=list(range(1,13,1))
returns=list()
for i in range(0,len(stocks2)):
if stocks2['Startpoint'][i]==1:
nextmonth=nextmonth+stocks2['RET'][i+1]+stocks2['RET'][i+2]+stocks2['RET'][i+3]+....
counter+=1
Is there a way to give the beginning Index and the end index and prob step size and then sum it all in one command instead of copy and paste to death? I wanted to go trough all the different investment lengths and put the avg returns in the empty list.
SHRCD EXCHCD SICCD PRC VOL RET SHROUT \
DATE PERMNO
1970-08-31 10559.0 10.0 1.0 5311.0 35.000 1692.0 0.030657 12048.0
12626.0 10.0 1.0 5411.0 46.250 926.0 0.088235 6624.0
12749.0 11.0 1.0 5331.0 45.500 5632.0 0.126173 34685.0
13100.0 11.0 1.0 5311.0 22.000 1759.0 0.171242 15107.0
13653.0 10.0 1.0 5311.0 13.125 141.0 0.220930 1337.0
13936.0 11.0 1.0 2331.0 11.500 270.0 -0.053061 3942.0
14322.0 11.0 1.0 5311.0 64.750 6934.0 0.024409 154187.0
16969.0 10.0 1.0 5311.0 42.875 1069.0 0.186851 13828.0
17072.0 10.0 1.0 5311.0 14.750 777.0 0.026087 5415.0
17304.0 10.0 1.0 5311.0 24.875 1939.0 0.058511 8150.0
MV XRET IB ... PE2 \
DATE PERMNO ...
1970-08-31 10559.0 421680.000 0.025357 NaN ... 13.852692
12626.0 306360.000 0.082935 NaN ... 13.145312
12749.0 1578167.500 0.120873 NaN ... 25.970466
13100.0 332354.000 0.165942 NaN ... 9.990711
13653.0 17548.125 0.215630 NaN ... 6.273570
13936.0 45333.000 -0.058361 NaN ... 6.473123
14322.0 9983608.250 0.019109 NaN ... 22.204047
16969.0 592875.500 0.181551 NaN ... 11.948061
17072.0 79871.250 0.020787 NaN ... 8.845526
17304.0 202731.250 0.053211 NaN ... 8.641655
lagPE1 lagPE2 lagMV lagSEQ QUINTILE1 \
DATE PERMNO
1970-08-31 10559.0 13.852692 13.852692 412644.000 264.686 4
12626.0 13.145312 13.145312 281520.000 164.151 4
12749.0 25.970466 25.970466 1404742.500 367.519 5
13100.0 9.990711 9.990711 288921.375 414.820 3
13653.0 6.273570 6.273570 14372.750 24.958 1
13936.0 6.473123 6.473123 48289.500 76.986 1
14322.0 22.204047 22.204047 9790874.500 3439.802 5
16969.0 11.948061 11.948061 499536.500 NaN 4
17072.0 8.845526 8.845526 77840.625 NaN 3
17304.0 8.641655 8.641655 191525.000 307.721 3
QUINTILE2 avgvol avg Startpoint
DATE PERMNO
1970-08-31 10559.0 4 9229.057592 1697.2 0
12626.0 4 3654.367470 894.4 0
12749.0 5 188206.566860 5828.6 0
13100.0 3 94127.319048 3477.2 0
13653.0 1 816.393162 268.8 0
13936.0 1 71547.050633 553.2 0
14322.0 5 195702.521519 6308.8 0
16969.0 4 3670.297872 2002.0 0
17072.0 3 3774.083333 3867.8 0
17304.0 3 12622.112903 1679.4 0

How to exactly compute percentage change with NaN in DataFrame for each day?

I would like to compute a daily percentage change for this DataFrame (frame_):
import pandas as pd
import numpy as np
data_ = {
'A':[1,np.NaN,2,1,1,2],
'B':[1,2,3,1,np.NaN,1],
'C':[1,2,np.NaN,1,1,2],
}
dates_ = [
'06/01/2018','05/01/2018','04/01/2018','03/01/2018','02/01/2018', '01/01/2018'
]
frame_ = pd.DataFrame(data_, index=dates_, columns=['A','B','C'])
The issue is that I get a DataFrame with this method:
returns_ = frame_.pct_change(periods=1, fill_method='pad')
dates,A,B,C
06/01/2018,,,
05/01/2018,,1.0,1.0
04/01/2018,1.0,0.5,
03/01/2018,-0.5,-0.6666666666666667,-0.5
02/01/2018,0.0,,0.0
01/01/2018,1.0,0.0,1.0
Which is not what I am looking for. And the dropna() method also doesn't give me the result I seek. I would like to compute a value for each day which has value and NaN for the day where there is no value or NaN. For example, on column A: as a percentage change I would like to see
dates,A
06/01/2018,1
05/01/2018,
04/01/2018,1.0
03/01/2018,-0.5
02/01/2018,0.0
01/01/2018,1.0
Many thanks in advance
This is one way, a bit by brute-force.
import pandas as pd
import numpy as np
data_ = {
'A':[1,np.NaN,2,1,1,2],
'B':[1,2,3,1,np.NaN,1],
'C':[1,2,np.NaN,1,1,2],
}
dates_ = [
'06/01/2018','05/01/2018','04/01/2018','03/01/2018','02/01/2018', '01/01/2018'
]
frame_ = pd.DataFrame(data_, index=dates_, columns=['A','B','C'])
frame_ = pd.concat([frame_, pd.DataFrame(columns=['dA', 'dB', 'dC'])])
for col in ['A', 'B', 'C']:
frame_['d'+col] = frame_[col].pct_change()
frame_.loc[pd.notnull(frame_[col]) & pd.isnull(frame_['d'+col]), 'd'+col] = frame_[col]
# A B C dA dB dC
# 06/01/2018 1.0 1.0 1.0 1.0 1.000000 1.0
# 05/01/2018 NaN 2.0 2.0 NaN 1.000000 1.0
# 04/01/2018 2.0 3.0 NaN 1.0 0.500000 NaN
# 03/01/2018 1.0 1.0 1.0 -0.5 -0.666667 -0.5
# 02/01/2018 1.0 NaN 1.0 0.0 NaN 0.0
# 01/01/2018 2.0 1.0 2.0 1.0 0.000000 1.0

Categories

Resources