I'm trying to calculate tracking error for a number of different benchmarks versus a fund that I'm looking at (tracking error is defined as the standard deviation of the percent difference between the fund and benchmark). The time series for the fund and all the benchmarks are all in a data frame that I'm reading from an excel on file and what I have so far is this (with the idea that arg1 represents all the benchmarks and is then applied using applymap), but it's returning a KeyError, any suggestions?
import pandas as pd
import numpy as np
data = pd.read_excel('File_Path.xlsx')
def index_analytics(arg1):
tracking_err = np.std((data['Fund'] - data[arg1]) / data[arg1])
return tracking_err
data.applymap(index_analytics)
There are a few things that need fixed. First,applymap passes each individual value for all the columns to your calling function (index_analytics). So arg1 is the individual scalar value for all the values in your dataframe. data[arg1] is always going to return a key error unless all your values are also column names.
You also shouldn't need to use apply to do this. Assuming your benchmarks are in the same dataframe then you should be able to do something like this for each benchmark. Next time include a sample of your dataframe.
df['Benchmark1_result'] = (df['Fund'] - data['Benchmark1']) / data['Benchmark1']
And if you want to calculate all the standard deviations for all the benchmarks you can do this
# assume you have a dataframe with a list of all the benchmark columns
benchmark_columns = [list, of, benchmark, columns]
np.std((df['Fund'].values - df[benchmark_columns].values) / df['Fund'].values, axis=1)
Assuming you're following the definition of Tracking Error below:
import pandas as pd
import numpy as np
# Example DataFrame
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
df['Active_Return'] = df['Portfolio_Returns'] - df['Bench_Returns']
print(df.head())
list_ = df['Active_Return']
temp_ = []
for val in list_:
x = val**2
temp_.append(x)
tracking_error = np.sqrt(sum(temp_))
print(f"Tracking Error is: {tracking_error}")
Or if you want it more compact (because apparently the cool kids do it):
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
tracking_error = np.sqrt(sum([val**2 for val in df['Portfolio_Returns'] - df['Bench_Returns']]))
print(f"Tracking Error is: {tracking_error}")
Related
Ive been trying to create a generic weather importer that can resample data to set intervals (e.g. from 20min to hours or the like (I've use 60min in the code below)).
For this I wanted to use the Pandas resample function. After a bit of puzzling I came up with the below (which is not the prettiest code). I had one problem with the averaging of the wind direction for the set periods, which I've tried to solve with pandas' resampler.apply.
However, I've hit a problem with the definition which gives the following error:
TypeError: can't convert complex to float
I realise I'm trying to force a square peg in a round hole, but I have no idea how to overcome this. Any hints would be appreciated.
raw data
import pandas as pd
import os
from datetime import datetime
from pandas import ExcelWriter
from math import *
os.chdir('C:\\test')
file = 'bom.csv'
df = pd.read_csv(file,skiprows=0, low_memory=False)
#custom dataframe reampler (.resampler.apply)
def custom_resampler(thetalist):
try:
s=0
c=0
n=0.0
for theta in thetalist:
s=s+sin(radians(theta))
c=c+cos(radians(theta))
n+=1
s=s/n
c=c/n
eps=(1-(s**2+c**2))**0.5
sigma=asin(eps)*(1+(2.0/3.0**0.5-1)*eps**3)
except ZeroDivisionError:
sigma=0
return degrees(sigma)
# create time index and format dataframes
df['DateTime'] = pd.to_datetime(df['DateTime'],format='%d/%m/%Y %H:%M')
df.index = df['DateTime']
df = df.drop(['Year','Month', 'Date', 'Hour', 'Minutes','DateTime'], axis=1)
dfws = df
dfwdd = df
dfws = dfws.drop(['WDD'], axis=1)
dfwdd = dfwdd.drop(['WS'], axis=1)
#resample data to xxmin and merge data
dfwdd = dfwdd.resample('60T').apply(custom_resampler)
dfws = dfws.resample('60T').mean()
dfoutput = pd.merge(dfws, dfwdd, right_index=True, left_index=True)
# write series to Excel
writer = pd.ExcelWriter('bom_out.xlsx', engine='openpyxl')
dfoutput.to_excel(writer, sheet_name='bom_out')
writer.save()
Did a bit more research and found that changing the definition worked best.
However, this gave a weird outcome by opposing angle (180degrees) division, which I accidently discovered. I had to deduct a small value, which will give a degree error in the actual outcome.
I would still be interested to know:
what was done wrong with the complex math
a better solution for opposing angles (180 degrees)
# changed the imports
from math import sin,cos,atan2,pi
import numpy as np
#changed the definition
def custom_resampler(angles,weights=0,setting='degrees'):
'''computes the mean angle'''
if weights==0:
weights=np.ones(len(angles))
sumsin=0
sumcos=0
if setting=='degrees':
angles=np.array(angles)*pi/180
for i in range(len(angles)):
sumsin+=weights[i]/sum(weights)*sin(angles[i])
sumcos+=weights[i]/sum(weights)*cos(angles[i])
average=atan2(sumsin,sumcos)
if setting=='degrees':
average=average*180/pi
if average == 180 or average == -180: #added since 290 degrees and 110degrees average gave a weird outcome
average -= 0.1
elif average < 0:
average += 360
return round(average,1)
I'm trying to create a function to calculate Heikin Ashi candles (for financial analysis).
My indicators.py file looks like
from pandas import DataFrame, Series
def heikinashi(dataframe):
open = dataframe['open']
high = dataframe['high']
low = dataframe['low']
close = dataframe['close']
ha_close = 0.25 * (open + high + low + close)
ha_open = 0.5 * (open.shift(1) + close.shift(1))
ha_low = max(high, ha_open, ha_close)
ha_high = min(low, ha_open, ha_close)
return dataframe, ha_close, ha_open, ha_low, ha_high
And in my main script i'm trying to call this function the most effective way to return those four dataframes: ha_close, ha_open, ha_low and ha_high
My main script looks something like:
import indicators as ata
ha_close, ha_open, ha_low, ha_high = ata.heikinashi(dataframe)
dataframe['ha_close'] = ha_close(dataframe)
dataframe['ha_open'] = ha_open(dataframe)
dataframe['ha_low'] = ha_low(dataframe)
dataframe['ha_high'] = ha_high(dataframe)
but for some strange reason I cannot find the dataframes
What will be the most efficient way to perform this? code wise and with minimal calls
I expect to have dataframe['ha_close'] etc returned with the correct data as shown in the function
Any advice appreciated
Thank you!
In heikinashi() you are returning five elements, dataframe and ha_*, however, in your main script you are only assigning the return values to four variables.
Try changing the main script to:
dataframe, ha_close, ha_open, ha_low, ha_high = ata.heikinashi(dataframe)
Apart from that, it looks like dataframe which you are passing into heikinashi() is not defined anywhere.
No matter what I do I don't seem to be able to add all the base volumes and quote volumes together easily! I want to end up with a total base volume and a total quote volume of all the data in the data frame. Can someone help me on how you can do this easily?
I have tried summing and saving the data in a dictionary first and then adding it but I just don't seem to be able to make this work!
import urllib
import pandas as pd
import json
def call_data(): # Call data from Poloniex
global df
datalink = 'https://poloniex.com/public?command=returnTicker'
df = urllib.request.urlopen(datalink)
df = df.read().decode('utf-8')
df = json.loads(df)
global current_eth_price
for k, v in df.items():
if 'ETH' in k:
if 'USDT_ETH' in k:
current_eth_price = round(float(v['last']),2)
print("Current ETH Price $:",current_eth_price)
def calc_volumes(): # Calculate the base & quote volumes
global volume_totals
for k, v in df.items():
if 'ETH' in k:
basevolume = float(v['baseVolume'])*current_eth_price
quotevolume = float(v['quoteVolume'])*float(v['last'])*current_eth_price
if quotevolume > 0:
percentages = (quotevolume - basevolume) / basevolume * 100
volume_totals = {'key':[k],
'basevolume':[basevolume],
'quotevolume':[quotevolume],
'percentages':[percentages]}
print("volume totals:",volume_totals)
print("#"*8)
call_data()
calc_volumes()
A few notes:
For the next 2 years don't use the keyword globals for anything.
put function documentation under the function in quotes
using the requests library will be much easier than urllib. However ...
pandas can fetch the JSON and parse it all in one step
ok it doesn't have to be as split up as this, I'm just showing you how to properly pass variables around instead of globals.
I could not find "ETH" by itself. In the data they sent they have these 3 ['BTC_ETH', 'USDT_ETH', 'USDC_ETH']. So I used "USDT_ETH" I hope the substitution is ok.
calc_volumes is seeming to do the calculation and being some sort of filter (it's picky as to what it prints). This function needs to be broken up in to it's two separate jobs. printing and calculating. (maybe there was a filter step but I leave that for homework)
.
import pandas as pd
eth_price_url = 'https://poloniex.com/public?command=returnTicker'
def get_data(url=''):
""" Call data from Poloniex and put it in a dataframe"""
data = pd.read_json(url)
return data
def get_current_eth_price(data = None):
""" grab the price out of the dataframe """
current_eth_price = data['USDT_ETH']['last'].round(2)
return current_eth_price
def calc_volumes(data=None, current_eth_price=None):
""" Calculate the base & quote volumes """
data = df[df.columns[df.columns.str.contains('ETH')]].loc[['baseVolume', 'quoteVolume', 'last']]
data = data.transpose()
data[['baseVolume','quoteVolume']]*= current_eth_price
data['quoteVolume']*=data['last']
data['percentages']=(data['quoteVolume'] - data['baseVolume']) / data['quoteVolume'] * 100
return data
df = get_data(url = eth_price_url)
the_price = get_current_eth_price(data = df)
print(f'the current eth price is: {the_price}')
volumes = calc_volumes(data=df, current_eth_price=the_price)
print(volumes)
This code seems kind of odd and inconsistent... for example, you're importing pandas and calling your variable df but you're not actually using dataframes. If you used df = pd.read_json('https://poloniex.com/public?command=returnTicker', 'index')* to get a dataframe, most of your data manipulation here would become much easier, and wouldn't require any loops either.
For example, the first function's code would become as simple as current_eth_price = df.loc['USDT_ETH','last'].
The second function's code would basically be
eth_rows = df[df.index.str.contains('ETH')]
total_base_volume = (eth_rows.baseVolume * current_eth_price).sum()
total_quote_volume = (eth_rows.quoteVolume * eth_rows['last'] * current_eth_price).sum()
(*The 'index' argument tells pandas to read the JSON dictionary indexed by rows, then columns, rather than columns, then rows.)
I'm working with a csv file in Python using Pandas.
I'm having a few troubles thinking on how to achieve the following goal.
What I need to achieve is to group entries using a similarity function.
For example, each group X should contain all entries where each couple in the group differs for at most Y on a certain attribute-column value.
Given this example of CSV:
<pre>
name;sex;city;age
john;male;newyork;20
jack;male;newyork;21
mary;female;losangeles;45
maryanne;female;losangeles;48
eric;male;san francisco;29
jenny;female;boston2;30
mattia;na;BostonDynamics;50
</pre>
and considering the age column, with a difference of at most 3 on this value I would get the following groups:
A = {john;male;newyork;20
jack;male;newyork;21}
B={eric;male;san francisco;29
jenny;female;boston2;30}
C={mary;female;losangeles;45
maryanne;female;losangeles;48}
D={maryanne;female;losangeles;48
mattia;na;BostonDynamics;50}
Actually this is my work-around but I hope there exists something more pythonic.
import pandas as pandas
import numpy as numpy
def main():
csv_path = "../resources/dataset_string.csv"
csv_data_frame = pandas.read_csv(csv_path, delimiter=";")
print("\nOriginal Values:")
print(csv_data_frame)
sorted_df = csv_data_frame.sort_values(by=["age", "name"], kind="mergesort")
print("\nSorted Values by AGE & NAME:")
print(sorted_df)
min_age = int(numpy.min(sorted_df["age"]))
print("\nMin_Age:", min_age)
max_age = int(numpy.max(sorted_df["age"]))
print("\nMax_Age:", max_age)
threshold = 3
bins = numpy.arange(min_age, max_age, threshold)
print("Bins:", bins)
ind = numpy.digitize(sorted_df["age"], bins)
print(ind)
print("\n\nClustering by hand:\n")
current_min = min_age
for cluster in range(min_age, max_age, threshold):
next_min = current_min + threshold
print("<Cluster({})>".format(cluster))
print(sorted_df[(current_min <= sorted_df["age"]) & (sorted_df["age"] <= next_min)])
print("</Cluster({})>\n".format(cluster + threshold))
current_min = next_min
if __name__ == "__main__":
main()
On one attribute this is simple:
Sort
Linearly scan the data, and whenever the threshold is violated, begin a new group.
While this won't be optimal, it should be better than what you already have, at less cost.
However, in the multivariate case, finding he optimal groups is supposedly NP-hard, so finding the optimal grouping will require brute-force search in exponential time. So you will need to approximate this, either by AGNES (in O(n³)) or by CLINK (usually worse quality, but O(n²)).
As this is fairly expensive, it will not be a simple operator of your data frame.
I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I've tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby('someKey').sum(), all my memory is taken up and python eventually crashes.
How can I do a pivot on data this large with a limited ammount of RAM?
EDIT: adding sample code
The following code includes various test outputs along the way, but the last print is what we're really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn't show up in the output.
import pandas as pd
import numpy as np
import random
from pandas.io.pytables import *
import os
pd.set_option('io.hdf.default_format','table')
# create a small dataframe to simulate the real data.
def loadFrame():
frame = pd.DataFrame()
frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes
frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test
frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test
return frame
def pivotSegment(segmentNumber,passedFrame):
segmentSize = 3 #take 3 rows at a time
frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF
# ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values
span = pd.DataFrame()
span['catid'] = range(1,5+1)
span['shipmentid']=1
span['qty']=0
frame = frame.append(span)
return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \
aggfunc='sum',fill_value=0).reset_index()
def createStore():
store = pd.HDFStore('testdata.h5')
return store
segMin = 0
segMax = 4
store = createStore()
frame = loadFrame()
print('Printing Frame')
print(frame)
print(frame.info())
for i in range(segMin,segMax):
segment = pivotSegment(i,frame)
store.append('data',frame[(i*3):(i*3 + 3)])
store.append('pivotedData',segment)
print('\nPrinting Store')
print(store)
print('\nPrinting Store: data')
print(store['data'])
print('\nPrinting Store: pivotedData')
print(store['pivotedData'])
print('**************')
print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('**************')
print('$$$')
for df in store.select('pivotedData',chunksize=3):
print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('$$$')
store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3)))
print('\nPrinting Store: pivotedAndSummed')
print(store['pivotedAndSummed'])
store.close()
os.remove('testdata.h5')
print('closed')
You could do the appending with HDF5/pytables. This keeps it out of RAM.
Use the table format:
store = pd.HDFStore('store.h5')
for ...:
...
chunk # the chunk of the DataFrame (which you want to append)
store.append('df', chunk)
Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):
df = store['df']
You can also query, to get only subsections of the DataFrame.
Aside: You should also buy more RAM, it's cheap.
Edit: you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:
# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()
Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:
reduce(lambda x, y: x.add(y, fill_value=0),
(df.groupby().sum() for df in store.select('df', chunksize=50000)))
In python 3 you must import reduce from functools.
Perhaps it's more pythonic/readable to write this as:
chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks) # will raise if there are no chunks!
for c in chunks:
res = res.add(c, fill_value=0)
If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.