I want to do calculations on three columns of a dataframe df. In order to do that I want run a price of assets (cryptocurrencies) list in a three column table in order to calculate the exponential moving average of them after having enough data.
def calculateAllEMA(self,values_array):
df = pd.DataFrame(values_array, columns=['BTC', 'ETH', 'DASH'])
column_by_search = ["BTC", "ETH", "DASH"]
print(df)
for i,column in enumerate(column_by_search):
ema=[]
# over and over for each day that follows day 23 to get the full range of EMA
for j in range(0, len(column)-24):
# Add the closing prices for the first 22 days together and divide them by 22.
EMA_yesterday = column.iloc[1+j:22+j].mean()
k = float(2)/(22+1)
# getting the first EMA day by taking the following day’s (day 23) closing price multiplied by k, then multiply the previous day’s moving average by (1-k) and add the two.
ema.append(column.iloc[23 + j]*k+EMA_yesterday*(1-k))
print("ema")
print(ema)
mean_exp[i] = ema[-1]
return mean_exp
Yet, when I print what's in len(column)-24 I get -21 (-24 + 3 ?). I can't therefore go through the loop. How can I cope with this error to get exponential moving average of the assets ?
I tried to apply this link from iexplain.com for the pseudo code of the exponential moving average.
If you have any easier idea, I'm open to hear it.
Here is the data that I use to calculate it when it bugs :
BTC ETH DASH
0 4044.59 294.40 196.97
1 4045.25 294.31 196.97
2 4044.59 294.40 196.97
3 4045.25 294.31 196.97
4 4044.59 294.40 196.97
5 4045.25 294.31 196.97
6 4044.59 294.40 196.97
7 4045.25 294.31 196.97
8 4045.25 294.31 196.97
9 4044.59 294.40 196.97
10 4045.25 294.31 196.97
11 4044.59 294.40 196.97
12 4045.25 294.31 196.97
13 4045.25 294.32 197.07
14 4045.25 294.31 196.97
15 4045.41 294.46 197.07
16 4045.25 294.41 197.07
17 4045.41 294.41 197.07
18 4045.41 294.47 197.07
19 4045.25 294.41 197.07
20 4045.25 294.32 197.07
21 4045.43 294.35 197.07
22 4045.41 294.46 197.07
23 4045.25 294.41 197.07
pandas.stats.moments.ewma from the original answer has been deprecated.
Instead you can use pandas.DataFrame.ewm as documented here.
Below is a complete snippet with random data that builds a dataframe with calculated ewmas from specified columns.
Code:
# imports
import pandas as pd
import numpy as np
np.random.seed(123)
rows = 50
df = pd.DataFrame(np.random.randint(90,110,size=(rows, 3)), columns=['BTC', 'ETH', 'DASH'])
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
def ewmas(df, win, keepSource):
"""Add exponentially weighted moving averages for all columns in a dataframe.
Arguments:
df -- pandas dataframe
win -- length of ewma estimation window
keepSource -- True or False for keep or drop source data in output dataframe
"""
df_temp = df.copy()
# Manage existing column names
colNames = list(df_temp.columns.values).copy()
removeNames = colNames.copy()
i = 0
for col in colNames:
# Make new names for ewmas
ewmaName = colNames[i] + '_ewma_' + str(win)
# Add ewmas
#df_temp[ewmaName] = pd.stats.moments.ewma(df[colNames[i]], span = win)
df_temp[ewmaName] = df[colNames[i]].ewm(span = win, adjust=True).mean()
i = i + 1
# Remove estimates with insufficient window length
df_temp = df_temp.iloc[win:]
# Remove or keep source data
if keepSource == False:
df_temp = df_temp.drop(removeNames,1)
return df_temp
# Test run
df_new = ewmas(df = df, win = 22, keepSource = True)
print(df_new.tail())
Output:
BTC ETH DASH BTC_ewma_22 ETH_ewma_22 DASH_ewma_22
dates
2017-02-15 91 96 98 98.752431 100.081052 97.926787
2017-02-16 100 102 102 98.862445 100.250270 98.285973
2017-02-17 100 107 97 98.962634 100.844749 98.172712
2017-02-18 103 102 91 99.317826 100.946384 97.541684
2017-02-19 99 104 91 99.289894 101.214755 96.966758
Plot using df_new[['BTC', 'BTC_ewma_22']].plot():
In your loop for i,column in enumerate(column_by_search): you iterate over the elements in your column_by_search list, that is column takes on the values "BTC", "ETH", "DASH" in turn. Thus, len(column) will give you the length of the string "BTC", which is 3 in fact.
Try df[column] instead, that will return a list with the elements in the desired column and you can iterate over it.
Related
I have a TSV that looks as follows:
chr_1 start_1 chr_2 start_2
11 69633786 14 105884873
12 81940993 X 137690551
13 29782093 12 97838049
14 105864244 11 69633799
17 33207000 20 9992701
17 38446991 20 2102271
17 38447482 17 29623333
20 9992701 17 33207000
20 10426599 17 33094167
20 13765533 17 29469669
22 27415959 8 36197094
22 37191634 8 38983042
22 44464751 18 74004141
8 36197054 22 23130534
8 36197054 22 23131537
8 36197054 8 23130539
This will be referred to as transDiffStartEndChr, which is a Dataframe.
I am working on a program that takes this TSV as input, and outputs rows that have the same chr_1 and chr_2, and a start_1 and start_2 that are +/- 1000.
Ideal output would look like:
chr_1 start_1 chr_2 start_2
8 36197054 8 23130539
8 36197054 22 23131537
Potentially creating groups for every hit based on chr_1 and chr_2.
My current script/thoughts:
transDiffStartEndChr = pd.read_csv('test-input.tsv', sep='\t')
#I will extract rows first by chr_1, in this case I'm doing a test case for 17.
rowsStartChr17 = transDiffStartEndChr[transDiffStartEndChr.apply(extractChr, chr='17', axis=1)]
#I figure I can do something stupid and using brute force, but I feel like I'm not tackling this problem correctly
for index, row in rowsStartChr17.iterrows():
for index2, row2 in rowsStartChr17.iterrows():
if index == index2:
continue
elif row['chr_1'] == row2['chr_1'] and row['chr_2'] == row2['chr_2']:
if proximityCheck(row['start_1'], row2['start_1']) and proximityCheck(row['start_2'], row2['start_2']):
print(f'Row: {index} Match: {index2}')
Any thoughts are appreciated.
Can play with numpy and pandas to filter out the groups that don't match your requirements.
>>> df.groupby(['chr_1', 'chr_2'])\
.filter(lambda s: len(np.array(np.where(
np.tril(
np.abs(
np.subtract.outer(s['start_2'].values,
s['start_2'].values)) < 1500 , -1)))\
.flatten()) > 0)
The logic is to groupby chr_1 and chr_2 and perform an outer subtraction between start_2 values to check whether we can values below 1500 (the threshold I used).
I have dataset as below,
index
10_YR_CAGR
5_YR_CAGR
1_YR_CAGR
c1_rev
20.5
21.5
31.5
c2_rev
20.5
22.5
24
c3_rev
21
24
27
c4_rev
20
26
30
c5_rev
24
19
15
c1_eps
21
22
23
c2_eps
21
24
25
This data has 5 companies and its parameters like rev, eps, profit etc. I need to plot as below:
rev:
x_axis-> index_col c1_rev, ...c5_rev
y_axis -> 10_YR_CAGR .. 1_YR_CAGR
eps:
x_axis -> index_col: c1_eps,...c5_eps
y_axis -> 10_YR_CAGR,... 1_YR_CAGR
etc...
I have tried with following code:
eps = analysis_df[analysis_df.index.str.contains('eps',regex=True)]
for i1 in eps.columns[eps.columns!='index']:
sns.lineplot(x="index",y=i1,data=eps,label=i1)
I have to make a dataframe from source and then loop it. How can I try to create a for loop which loops from the main source dataframe itself?
Instead of creating a loop for separate parameters, how can I loop from the main source dataframe to create a chart of plots with parameters like rev, eps, profit to facegrid parameters? How to apply those filter in facetgrid?
My sample output of the above code,
How to plot the same sort of plot for different parameters in a single for loop?
The way facets are typically plotted is by "melting" your analysis_df into id/variable/value columns.
split() the index column into Company and Parameter, which we'll later use as id columns when melting:
analysis_df[['Company', 'Parameter']] = analysis_df['index'].str.split('_', expand=True)
# index 10_YR_CAGR 5_YR_CAGR 1_YR_CAGR Company Parameter
# 0 c1_rev 100 21 1 c1 rev
# 1 c2_rev 1 32 24 c2 rev
# ...
melt() the CAGR columns:
melted = analysis_df.melt(
id_vars=['Company', 'Parameter'],
value_vars=['10_YR_CAGR', '5_YR_CAGR', '1_YR_CAGR'],
var_name='Period',
value_name='CAGR',
)
# Company Parameter Period CAGR
# 0 c1 rev 10_YR_CAGR 100
# 1 c2 rev 10_YR_CAGR 1
# 2 c3 rev 10_YR_CAGR 14
# 3 c1 eps 10_YR_CAGR 1
# ...
# 25 c2 pft 1_YR_CAGR 14
# 26 c3 pft 1_YR_CAGR 17
relplot() CAGR vs Company (colored by Period) for each Parameter using the melted dataframe:
sns.relplot(
data=melted,
kind='line',
col='Parameter',
x='Company',
y='CAGR',
hue='Period',
col_wrap=1,
facet_kws={'sharex': False, 'sharey': False},
)
Sample data to reproduce this plot:
import io
import pandas as pd
csv = '''
index,10_YR_CAGR,5_YR_CAGR,1_YR_CAGR
c1_rev,100,21,1
c2_rev,1,32,24
c3_rev,14,23,7
c1_eps,1,20,50
c2_eps,21,20,25
c3_eps,31,20,37
c1_pft,20,1,10
c2_pft,25,20,14
c3_pft,11,55,17
'''
analysis_df = pd.read_csv(io.StringIO(csv))
I am trying to build an forward annual EONIA forward curve with inputs of tenors from 1 week to 50 years.
I have managed to code thus far:
data
maturity spot rate
0 1 -0.529
1 2 -0.529
2 3 -0.529
3 1 -0.504
4 2 -0.505
5 3 -0.506
6 4 -0.508
7 5 -0.509
8 6 -0.510
9 7 -0.512
10 8 -0.514
11 9 -0.515
12 10 -0.517
13 11 -0.518
14 1 -0.520
15 15 -0.524
16 18 -0.526
17 21 -0.527
18 2 -0.528
19 3 -0.519
20 4 -0.501
21 5 -0.476
22 6 -0.441
23 7 -0.402
24 8 -0.358
25 9 -0.313
26 10 -0.265
27 11 -0.219
28 12 -0.174
29 15 -0.062
30 20 0.034
31 25 0.054
32 30 0.039
33 40 -0.001
34 50 -0.037
terms= data["maturity"].tolist()
rates= data['spot rate'].tolist()
calendar = ql.TARGET()
business_convention = ql.ModifiedFollowing
day_count = ql.Actual360()
settlement_days_EONIA = 2
EONIA = ql.OvernightIndex("EONIA", settlement_days_EONIA, ql.EURCurrency(), calendar, day_count)
# Deposit Helper
depo_facility = -0.50
depo_helper = [ql.DepositRateHelper(ql.QuoteHandle(ql.SimpleQuote(depo_facility/100)), ql.Period(1,ql.Days), 1, calendar, ql.Unadjusted, False, day_count)]
# OIS Helper
OIS_helpers = []
for i in range(len(terms)):
if i < 3:
tenor = ql.Period(ql.Weeks)
eon_rate = rates[i]
OIS_helpers.append(ql.OISRateHelper(settlement_days_EONIA, tenor, ql.QuoteHandle(ql.SimpleQuote(eon_rate/100)), EONIA))
elif i < 12:
tenor = ql.Period(ql.Months)
eon_rate = rates[i]
OIS_helpers.append(ql.OISRateHelper(settlement_days_EONIA, tenor, ql.QuoteHandle(ql.SimpleQuote(eon_rate/100)), EONIA))
else:
tenor = ql.Period(ql.Years)
eon_rate = rates[i]
OIS_helpers.append(ql.OISRateHelper(settlement_days_EONIA, tenor, ql.QuoteHandle(ql.SimpleQuote(eon_rate/100)), EONIA))
rate_helpers = depo_helper + OIS_helpers
eonia_curve_c = ql.PiecewiseLogCubicDiscount(0, ql.TARGET(), rate_helpers, day_count)
#This doesn't give me a daily grid of rates, but only the rates at the tenors of my input
eonia_curve_c.enableExtrapolation()
days = ql.MakeSchedule(eonia_curve_c.referenceDate(), eonia_curve_c.maxDate(), ql.Period('1Y'))
rates_fwd = [
eonia_curve_c.forwardRate(d, calendar.advance(d,365,ql.Days), day_count, ql.Simple).rate()*100
for d in days
]
The problem is that when I run the code, I get the following error:
RuntimeError: more than one instrument with pillar June 18th, 2021
There is probably an error somewhere in the code for the OIS helper, where there is an overlap but I am not sure what I have done wrong. Anyone know what the problem is?
First off, apologies for any inelegant Python, as I am coming from C++:
The main issue with the original question was that ql.Period() takes two parameters when used with an integer number of periods: eg ql.Period(3,ql.Years). If instead you construct the input array with string representations of the tenors eg '3y' you can just pass this string to ql.Period(). So ql.Period(3,ql.Years) and ql.Period('3y') give the same result.
import QuantLib as ql
import numpy as np
import pandas as pd
curve = [ ['1w', -0.529],
['2w', -0.529],
['3w', -0.529],
['1m', -0.504],
['2m', -0.505],
['3m', -0.506],
['4m', -0.508],
['5m', -0.509],
['6m', -0.510],
['7m', -0.512],
['8m', -0.514],
['9m', -0.515],
['10m', -0.517],
['11m', -0.518],
['1y', -0.520],
['15m', -0.524],
['18m', -0.526],
['21m', -0.527],
['2y', -0.528],
['3y', -0.519],
['4y', -0.501],
['5y', -0.476],
['6y', -0.441],
['7y', -0.402],
['8y', -0.358],
['9y', -0.313],
['10y', -0.265],
['11y', -0.219],
['12y', -0.174],
['15y', -0.062],
['20y', 0.034],
['25y', 0.054],
['30y', 0.039],
['40y', -0.001],
['50y', -0.037] ]
data = pd.DataFrame(curve, columns = ['maturity','spot rate'])
print('Input curve\n',data)
terms= data["maturity"].tolist()
rates= data['spot rate'].tolist()
calendar = ql.TARGET()
day_count = ql.Actual360()
settlement_days_EONIA = 2
EONIA = ql.OvernightIndex("EONIA", settlement_days_EONIA, ql.EURCurrency(), calendar, day_count)
# Deposit Helper
depo_facility = -0.50
depo_helper = [ql.DepositRateHelper(ql.QuoteHandle(ql.SimpleQuote(depo_facility/100)), ql.Period(1,ql.Days), 1, calendar, ql.Unadjusted, False, day_count)]
# OIS Helper
OIS_helpers = []
for i in range(len(terms)):
tenor = ql.Period(terms[i])
eon_rate = rates[i]
OIS_helpers.append(ql.OISRateHelper(settlement_days_EONIA, tenor, ql.QuoteHandle(ql.SimpleQuote(eon_rate/100)), EONIA))
rate_helpers = depo_helper + OIS_helpers
eonia_curve_c = ql.PiecewiseLogCubicDiscount(0, ql.TARGET(), rate_helpers, day_count)
#This doesn't give me a daily grid of rates, but only the rates at the tenors of my input
eonia_curve_c.enableExtrapolation()
days = ql.MakeSchedule(eonia_curve_c.referenceDate(), eonia_curve_c.maxDate(), ql.Period('1Y'))
rates_fwd = [
eonia_curve_c.forwardRate(d, calendar.advance(d,365,ql.Days), day_count, ql.Simple).rate()*100
for d in days
]
print('Output\n',pd.DataFrame(rates_fwd,columns=['Fwd rate']))
I am attempting to interpolate a value based on a number's position in a different column. Take this column for instance:
Coupon Price
9.5 109.04
9.375 108.79
9.25 108.54
9.125 108.29
9 108.04
8.875 107.79
8.75 107.54
8.625 107.29
8.5 107.04
8.375 106.79
8.25 106.54
Lets say I have a number like 107. I want to be able to find 107's relative distance from both 107.04 and 106.79 to interpolate the value that has the same relative distance between 8.5 and 8.375, the coupon values at the same index. Is this possible? I can solve this in excel using the FORECAST method, but want to know if it can be done in Python.
Welcome to Stack Overflow.
We need to make a custom function for this, unless there's a standard library function I'm unaware, which is entirely possible. I'm going to make a function that allows you to enter a bond by price and it will get inserted into the dataframe with the appropriate coupon.
Assuming we are starting with a sorted dataframe.
print(df)
Coupon Price
0 9.500 109.04
1 9.375 108.79
2 9.250 108.54
3 9.125 108.29
4 9.000 108.04
5 8.875 107.79
6 8.750 107.54
7 8.625 107.29
8 8.500 107.04
9 8.375 106.79
10 8.250 106.54
I've inserted comments into the function.
def add_bond(Price, df):
# Add row
df.loc[df.shape[0]] = [np.NaN, Price]
df = df.sort_values('Price', ascending=False).reset_index(drop=True)
# Get index
idx = df[df['Price'] == Price].head(1).index.tolist()[0]
# Get the distance from Prices from previous row to next row
span = abs(df.iloc[idx-1, 1] - df.iloc[idx +1, 1]).round(4)
# Get the distance and direction from Price from previous row to new value
terp = (df.iloc[idx, 1] - df.iloc[idx-1, 1]).round(4)
# Find the percentage movement from previous in percentage.
moved = terp / span
# Finally calculate the move from the previous for Coupon.
df.iloc[idx, 0] = df.iloc[idx-1,0] + (abs(df.iloc[idx-1,0] - df.iloc[idx+1, 0]) * (moved))
return df
A function to calculate the Coupon of a new bond using Price in the DataFrame.
# Add 107
df = add_bond(107, df)
print(df)
Coupon Price
0 9.500 109.04
1 9.375 108.79
2 9.250 108.54
3 9.125 108.29
4 9.000 108.04
5 8.875 107.79
6 8.750 107.54
7 8.625 107.29
8 8.500 107.04
9 8.480 107.00
10 8.375 106.79
11 8.250 106.54
Add one more.
# Add 107.9
df = add_bond(107.9, df)
print(df)
Coupon Price
0 9.500 109.04
1 9.375 108.79
2 9.250 108.54
3 9.125 108.29
4 9.000 108.04
5 8.930 107.90
6 8.875 107.79
7 8.750 107.54
8 8.625 107.29
9 8.500 107.04
10 8.480 107.00
11 8.375 106.79
12 8.250 106.54
If this answer meets your needs, please remember to select correct answer. Thanks.
Probably there's a function that does the work for you somewhere but my advice is to program it yourself, it's not difficult at all and it's a nice programming excercise. Just find the slope in that segment and use the equation a straight line:
(y-y0) = ((y1-y0)/(x1-x0))*(x-x0) -> y = ((y1-y0)/(x1-x0))*(x-x0) + y0
Where:
x -> Your given value (107)
x1 & x0 -> The values right above and below (107.04 & 106.79)
y1 & y0 -> The corresponding values to x1 & x0 (8.5 & 8.375)
y -> Your target value.
Just basic high-school maths ;-)
Long story short, I have a csv file which I read as a pandas dataframe. The file contains a weather report, but all of the measurements for temperature are in Fahrenheit. I've figured out how to convert them:
import pandas as np
df = np.read_csv('report.csv')
df['average temperature'] = (df['average temperature'] - 32) * 5/9
But then the data for this column is in decimals up to 6 points.
I've found code that will round up all the data in the dataframe, but I need only this column.
df.round(2)
I don't like how it has to be a separate piece of code on a separate line and how it modifies all of my data. Is there a way to go about this problem more elegantly? Is there a way to apply this to other columns in my dataframe, such as maximum temperature and minimum temperature without having to copy the above piece of code?
For round only some columns use subset:
cols = ['maximum temperature','minimum temperature','average temperature']
df[cols] = df[cols].round(2)
If want convert only some columns from list:
cols = ['maximum temperature','minimum temperature','average temperature']
df[cols] = ((df[cols] - 32) * 5/9).round(2)
If want round each column separately:
df['average temperature'] = df['average temperature'].round(2)
df['maximum temperature'] = df['maximum temperature'].round(2)
df['minimum temperature'] = df['minimum temperature'].round(2)
Sample:
df = (pd.DataFrame(np.random.randint(30, 100, (10, 3)),
columns=['maximum temperature','minimum temperature','average temperature'])
.assign(a='m', b=range(10)))
print (df)
maximum temperature minimum temperature average temperature a b
0 97 60 98 m 0
1 64 86 64 m 1
2 32 64 95 m 2
3 60 56 93 m 3
4 43 89 64 m 4
5 40 62 86 m 5
6 37 40 70 m 6
7 61 33 46 m 7
8 36 44 46 m 8
9 63 30 33 m 9
cols = ['maximum temperature','minimum temperature','average temperature']
df[cols] = ((df[cols] - 32) * 5/9).round(2)
print (df)
maximum temperature minimum temperature average temperature a b
0 36.11 15.56 36.67 m 0
1 17.78 30.00 17.78 m 1
2 0.00 17.78 35.00 m 2
3 15.56 13.33 33.89 m 3
4 6.11 31.67 17.78 m 4
5 4.44 16.67 30.00 m 5
6 2.78 4.44 21.11 m 6
7 16.11 0.56 7.78 m 7
8 2.22 6.67 7.78 m 8
9 17.22 -1.11 0.56 m 9
Here's a single line solution with apply and a conversion function.
def convert_to_celsius (f):
return 5.0/9.0*(f-32)
df[['Column A','Column B']] = df[['Column A','Column B']].apply(convert_to_celsius).round(2)