pandas find two rolling max highs and calculate slope - python

I'm looking for a way to find the two max highs in a rolling frame and calculate the slope to extrapolate a possible third high.
I have several problems with this :)
a) how to find a second high?
b) how to know the position of the two highs (for a simple slope : slope = (MaxHigh2-MaxHigh1)/(PosMaxHigh2-PosMaxHigh1))?
I could, of course, do something like this. but I only work if high1 > high2 :)
and I would not have the highs of the same range.
import quandl
import pandas as pd
import numpy as np
import sys
df = quandl.get("WIKI/GOOGL")
df = df.ix[:10, ['High', 'Close' ]]
df['MAX_HIGH_3P'] = df['High'].rolling(window=3,center=False).max()
df['MAX_HIGH_5P'] = df['High'].rolling(window=5,center=False).max()
df['SLOPE'] = (df['MAX_HIGH_5P']-df['MAX_HIGH_3P'])/(5-3)
print(df.head(20).to_string())

Sorry for a bit messy solution but I hope it helps:
first I define a function which takes as input numpy array, checks if at least 2 elements are not null, and then calculates slope (according to your formula - i think), looks like this:
def calc_slope(input_list):
if sum(~np.isnan(x) for x in input_list) < 2:
return np.NaN
temp_list = input_list[:]
max_value = np.nanmax(temp_list)
max_index = np.where(input_list == max_value)[0][0]
temp_list = np.delete(temp_list, max_index)
second_max = np.nanmax(temp_list)
second_max_index = np.where(input_list == second_max)[0][0]
return (max_value - second_max)/(1.0*max_index-second_max_index)
in variable df I have this :
And you just have to apply rolling window to whatever you prefer, in example applied to "High":
df['High'].rolling(window=5, min_periods=2, center=False).apply(lambda x: calc_slope(x))
Final result looks like this:
You can also store it in another columns if you like:
df['High_slope'] = df['High'].rolling(window=5, min_periods=2, center=False).apply(lambda x: calc_slope(x))
Is that what you wanted?

Related

Resampling of Weather Data for variable timeperiods by using Pandas Dataframe

Ive been trying to create a generic weather importer that can resample data to set intervals (e.g. from 20min to hours or the like (I've use 60min in the code below)).
For this I wanted to use the Pandas resample function. After a bit of puzzling I came up with the below (which is not the prettiest code). I had one problem with the averaging of the wind direction for the set periods, which I've tried to solve with pandas' resampler.apply.
However, I've hit a problem with the definition which gives the following error:
TypeError: can't convert complex to float
I realise I'm trying to force a square peg in a round hole, but I have no idea how to overcome this. Any hints would be appreciated.
raw data
import pandas as pd
import os
from datetime import datetime
from pandas import ExcelWriter
from math import *
os.chdir('C:\\test')
file = 'bom.csv'
df = pd.read_csv(file,skiprows=0, low_memory=False)
#custom dataframe reampler (.resampler.apply)
def custom_resampler(thetalist):
try:
s=0
c=0
n=0.0
for theta in thetalist:
s=s+sin(radians(theta))
c=c+cos(radians(theta))
n+=1
s=s/n
c=c/n
eps=(1-(s**2+c**2))**0.5
sigma=asin(eps)*(1+(2.0/3.0**0.5-1)*eps**3)
except ZeroDivisionError:
sigma=0
return degrees(sigma)
# create time index and format dataframes
df['DateTime'] = pd.to_datetime(df['DateTime'],format='%d/%m/%Y %H:%M')
df.index = df['DateTime']
df = df.drop(['Year','Month', 'Date', 'Hour', 'Minutes','DateTime'], axis=1)
dfws = df
dfwdd = df
dfws = dfws.drop(['WDD'], axis=1)
dfwdd = dfwdd.drop(['WS'], axis=1)
#resample data to xxmin and merge data
dfwdd = dfwdd.resample('60T').apply(custom_resampler)
dfws = dfws.resample('60T').mean()
dfoutput = pd.merge(dfws, dfwdd, right_index=True, left_index=True)
# write series to Excel
writer = pd.ExcelWriter('bom_out.xlsx', engine='openpyxl')
dfoutput.to_excel(writer, sheet_name='bom_out')
writer.save()
Did a bit more research and found that changing the definition worked best.
However, this gave a weird outcome by opposing angle (180degrees) division, which I accidently discovered. I had to deduct a small value, which will give a degree error in the actual outcome.
I would still be interested to know:
what was done wrong with the complex math
a better solution for opposing angles (180 degrees)
# changed the imports
from math import sin,cos,atan2,pi
import numpy as np
#changed the definition
def custom_resampler(angles,weights=0,setting='degrees'):
'''computes the mean angle'''
if weights==0:
weights=np.ones(len(angles))
sumsin=0
sumcos=0
if setting=='degrees':
angles=np.array(angles)*pi/180
for i in range(len(angles)):
sumsin+=weights[i]/sum(weights)*sin(angles[i])
sumcos+=weights[i]/sum(weights)*cos(angles[i])
average=atan2(sumsin,sumcos)
if setting=='degrees':
average=average*180/pi
if average == 180 or average == -180: #added since 290 degrees and 110degrees average gave a weird outcome
average -= 0.1
elif average < 0:
average += 360
return round(average,1)

Creating new pandas columns with original value plus random number in error range

I have a pandas dataframe which has a column 'INTENSITY' and a numpy array of same length containing the error for each intensity. I would like to generate columns with randomly generated numbers in the error range.
So far I use two nested for loops to create the new columns but I feel like this is inefficient:
theor_err = [ sqrt(abs(x)) for x in theor_df[str(INTENSITY)] ]
theor_err = np.asarray(theor_err)
for nr_sample in range(2):
sample = np.zeros(len(theor_df[str(INTENSITY)]))
for i, error in enumerate(theor_err):
sample[i] = theor_df[str(INTENSITY)][i] + random.uniform(-error, error)
theor_df['gen_{}'.format(nr_sample)] = Series(sample, index=theor_df.index)
theor_df.head()
Is there a more efficient way of approaching a problem like this?
Numpy can handle arrays for you. So, you can do it like this:
import pandas as pd
import numpy as np
a=pd.DataFrame([10,20,15,30],columns=['INTENSITY'])
a['theor_err']=np.sqrt(np.abs(a.INTENSITY))
a['sample']=np.random.uniform(-a['theor_err'],a['theor_err'])
Suppose you want to generate 6 samples. You can try to code below. You can tune the number of samples you want by setting the value k.
df = pd.DataFrame([[1],[2],[3],[4],[-5]], columns=["intensity"])
k = 6
sample_names = ["sample" + str(i+1) for i in range(k)]
df["err"] = np.sqrt(np.abs((df["intensity"])))
df[sample_names] = pd.DataFrame(
df["err"].map(lambda x: np.random.uniform(-x, x, k)).values.tolist())
df.loc[:,sample_names] = df.loc[:,sample_names].add(df.intensity, axis=0)

(Python) Pandas - GroupBy() using a similarity function

I'm working with a csv file in Python using Pandas.
I'm having a few troubles thinking on how to achieve the following goal.
What I need to achieve is to group entries using a similarity function.
For example, each group X should contain all entries where each couple in the group differs for at most Y on a certain attribute-column value.
Given this example of CSV:
<pre>
name;sex;city;age
john;male;newyork;20
jack;male;newyork;21
mary;female;losangeles;45
maryanne;female;losangeles;48
eric;male;san francisco;29
jenny;female;boston2;30
mattia;na;BostonDynamics;50
</pre>
and considering the age column, with a difference of at most 3 on this value I would get the following groups:
A = {john;male;newyork;20
jack;male;newyork;21}
B={eric;male;san francisco;29
jenny;female;boston2;30}
C={mary;female;losangeles;45
maryanne;female;losangeles;48}
D={maryanne;female;losangeles;48
mattia;na;BostonDynamics;50}
Actually this is my work-around but I hope there exists something more pythonic.
import pandas as pandas
import numpy as numpy
def main():
csv_path = "../resources/dataset_string.csv"
csv_data_frame = pandas.read_csv(csv_path, delimiter=";")
print("\nOriginal Values:")
print(csv_data_frame)
sorted_df = csv_data_frame.sort_values(by=["age", "name"], kind="mergesort")
print("\nSorted Values by AGE & NAME:")
print(sorted_df)
min_age = int(numpy.min(sorted_df["age"]))
print("\nMin_Age:", min_age)
max_age = int(numpy.max(sorted_df["age"]))
print("\nMax_Age:", max_age)
threshold = 3
bins = numpy.arange(min_age, max_age, threshold)
print("Bins:", bins)
ind = numpy.digitize(sorted_df["age"], bins)
print(ind)
print("\n\nClustering by hand:\n")
current_min = min_age
for cluster in range(min_age, max_age, threshold):
next_min = current_min + threshold
print("<Cluster({})>".format(cluster))
print(sorted_df[(current_min <= sorted_df["age"]) & (sorted_df["age"] <= next_min)])
print("</Cluster({})>\n".format(cluster + threshold))
current_min = next_min
if __name__ == "__main__":
main()
On one attribute this is simple:
Sort
Linearly scan the data, and whenever the threshold is violated, begin a new group.
While this won't be optimal, it should be better than what you already have, at less cost.
However, in the multivariate case, finding he optimal groups is supposedly NP-hard, so finding the optimal grouping will require brute-force search in exponential time. So you will need to approximate this, either by AGNES (in O(n³)) or by CLINK (usually worse quality, but O(n²)).
As this is fairly expensive, it will not be a simple operator of your data frame.

Optimizing Python Code: Faster groupby and for loops

I want to make a For Loop given below, faster in python.
import pandas as pd
import numpy as np
import scipy
np.random.seed(1)
xl = pd.DataFrame({'Concat' : np.arange(101,999), 'ships_x' : np.random.randint(1001,3000,size=898)})
yl = pd.DataFrame({'PickDate' : np.random.randint(1,8,size=10000),'Concat' : np.random.randint(101,999,size=10000), 'ships_x' : np.random.randint(101,300,size=10000), 'ships_y' : np.random.randint(1001,3000,size=10000)})
tempno = [np.random.randint(1,100,size=5)]
k=1
p = pd.DataFrame(0,index=np.arange(len(xl)),columns=['temp','cv']).astype(object)
for ib in [xb for xb in range(0,len(xl))]:
tempno1 = np.append(tempno,ib)
temp = list(set(tempno1))
temptab = yl[yl['Concat'].isin(np.array(xl['Concat'][tempno1]))].groupby('PickDate')['ships_x','ships_y'].sum().reset_index()
temptab['contri'] = temptab['ships_x']/temptab['ships_y']
p.ix[k-1,'cv'] = 1 if math.isnan(scipy.stats.variation(temptab['contri'])) else scipy.stats.variation(temptab['contri'])
p.ix[k-1,'temp'] = temp
k = k+1
where,
xl, yl - two data frames I am working on with columns like Concat, x_ships and y_ships.
tempno - a initial list of indices of xl dataframe, referring to a list of 'Concat' values.
So, in for loop we add one extra index to tempno in each iteration and then subset 'yl' dataframe based on 'Concat' values matching with those of 'xl' dataframe. Then, we find "coefficient of variation"(taken from scipy lib) and make note in new dataframe 'p'.
The problem is it is taking too much time as number of iterations of for loop varies in thousands. The 'group_by' line is taking maximum time. I have tried and made a few changes, now the code look likes below, changes made mentioned in comments. There is a slight improvement but this doesn't solve my purpose. Please suggest the fastest way possible to implement this. Many thanks.
# Getting all tempno1 into a list with one step
tempno1 = [np.append(tempno,ib) for ib in [xb for xb in range(0,len(xl))]]
temp = [list(set(tempk)) for tempk in tempno1]
# Taking only needed columns from x and y dfs
xtemp = xl[['Concat']]
ytemp = yl[['Concat','ships_x','ships_y','PickDate']]
#Shortlisting y df and groupby in two diff steps
ytemp = [ytemp[ytemp['Concat'].isin(np.array(xtemp['Concat'][tempnokk]))] for tempnokk in tempno1]
temptab = [ytempk.groupby('PickDate')['ships_x','ships_y'].sum().reset_index() for ytempk in ytemp]
tempkcontri = [tempk['ships_x']/tempk['ships_y'] for tempk in temptab]
tempkcontri = [pd.DataFrame(tempkcontri[i],columns=['contri']) for i in range(0,len(tempkcontri))]
temptab = [temptab[i].join(tempkcontri[i]) for i in range(0,len(temptab))]
pcv = [1 if math.isnan(scipy.stats.variation(temptabkk['contri'])) else scipy.stats.variation(temptabkk['contri']) for temptabkk in temptab]
p = pd.DataFrame({'temp' : temp,'cv': pcv})

Tracking Error on a number of benchmarks

I'm trying to calculate tracking error for a number of different benchmarks versus a fund that I'm looking at (tracking error is defined as the standard deviation of the percent difference between the fund and benchmark). The time series for the fund and all the benchmarks are all in a data frame that I'm reading from an excel on file and what I have so far is this (with the idea that arg1 represents all the benchmarks and is then applied using applymap), but it's returning a KeyError, any suggestions?
import pandas as pd
import numpy as np
data = pd.read_excel('File_Path.xlsx')
def index_analytics(arg1):
tracking_err = np.std((data['Fund'] - data[arg1]) / data[arg1])
return tracking_err
data.applymap(index_analytics)
There are a few things that need fixed. First,applymap passes each individual value for all the columns to your calling function (index_analytics). So arg1 is the individual scalar value for all the values in your dataframe. data[arg1] is always going to return a key error unless all your values are also column names.
You also shouldn't need to use apply to do this. Assuming your benchmarks are in the same dataframe then you should be able to do something like this for each benchmark. Next time include a sample of your dataframe.
df['Benchmark1_result'] = (df['Fund'] - data['Benchmark1']) / data['Benchmark1']
And if you want to calculate all the standard deviations for all the benchmarks you can do this
# assume you have a dataframe with a list of all the benchmark columns
benchmark_columns = [list, of, benchmark, columns]
np.std((df['Fund'].values - df[benchmark_columns].values) / df['Fund'].values, axis=1)
Assuming you're following the definition of Tracking Error below:
import pandas as pd
import numpy as np
# Example DataFrame
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
df['Active_Return'] = df['Portfolio_Returns'] - df['Bench_Returns']
print(df.head())
list_ = df['Active_Return']
temp_ = []
for val in list_:
x = val**2
temp_.append(x)
tracking_error = np.sqrt(sum(temp_))
print(f"Tracking Error is: {tracking_error}")
Or if you want it more compact (because apparently the cool kids do it):
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
tracking_error = np.sqrt(sum([val**2 for val in df['Portfolio_Returns'] - df['Bench_Returns']]))
print(f"Tracking Error is: {tracking_error}")

Categories

Resources