I have built up a simple DCF model mainly through Pandas. Basically all calculations happen in a single dataframe. I want to find a better coding style as the model becomes more complex and more variables have been added to the model. The following example may illustrate my current coding style - simple and straightforward.
# some customized formulas
def GrowthRate():
def BoundedVal()
....
# some operations
df['EBIT'] = df['revenue'] - df['costs']
df['NI'] = df['EBIT'] - df['tax'] - df['interests']
df['margin'] = df['NI'] / df['revenue']
I loop through all years to calculate values. Now I have added over 500 variables to the model and calculation also becomes more complex. I was thinking to create a separate def for each variable and update the main df accordingly. So the above code would become:
def EBIT(t):
df['EBIT'][t] = df['revenue'][t] - df['costs'][t]
#....some more ops
return df['EBIT'][t]
def NI(t):
df['NI'][t] = EBIT(t) - df['tax'][t] - df['interests'][t]
#....some more ops
return df['NI'][t]
def margin(t):
if check_df_is_nan():
df['margin'][t] = NI(t) - df['costs'][t]
#....some more ops
return df['margin'][t]
else:
return df['margin'][t]
Each function is able to 1) calculate results and update df 2)return value if called by other functions.
To avoid redundant calculation (think if margin(t) is called by multiple times), it would be better to add a "check if val has been calculated before" function to each def.
My question: 1) is it possible to add the if statement to a group of defs? similar to the if clause above.
2) I have over 50 custom defs so the main file becomes too long. I cannot simply move all defs to another file and import all because some defs also refer to the dataframe in the main file. Any suggestions? Can I set the df as a global variable so defs from other files are able to modify and update?
For 1, just check if the value is NaN or not.
import pandas as pd
def EBIT(t):
if pd.notnull(df['EBIT'][t]):
return df['EBIT'][t]
df['EBIT'][t] = df['revenue'][t] - df['costs'][t]
...
For 2, using global variable might work, but it's a bad approach. You should really try to avoid using them whenever possible.
What you should do is instead make each function take the global data frame as an argument. Then you can pass in the data frame you want to operate on.
# in some other file
def EBIT(df, t):
# logic goes here
# in the main file
import operations as op
# ...
op.EBIT(df, t)
enter code here
P.S. Have you consider doing operation on the whole column at once rather using t? It should be much faster.
Related
I ran into this nice blogpost : https://towardsdatascience.com/the-search-for-categorical-correlation-a1c.
The author creates a function that allows you to calculate associations between categorical features and then create a heatmap out of it.
The function is given as:
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x,y)
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
rcorr = r-((r-1)**2)/(n-1)
kcorr = k-((k-1)**2)/(n-1)
return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))
I am able to create a list of associations between one feature and the rest by running the function in a for loop.
for item in raw[categorical].columns.tolist():
value = cramers_v(raw['status_group'], raw[item])
print(item, value)
It works in the sense that I get a list of association values
but I don't know how I would run this function for all features against eachother and turn that into a new dataframe.
The author of this article has written a nice new library that has this feature built in, but it doesn't turn out nicely for my long list of features (my laptop can't handle it).
Running it on the first 100 lines of my df results in this... (note: this is what I get by running the associations function of the dython library written by the author).
How could I run the cramers_v function for all combinations of features and then turn this into a df which I could display in a heatmap?
I have many aggregations of the type below in my code:
period = 'ag'
index = ['PA']
lvl = 'pa'
wm = lambda x: np.average(x, weights=dfdom.loc[x.index, 'pop'])
dfpa = dfdom[(dfdom['stratum_kWh'] !=8)].groupby(index).agg(
pa_mean_ea_ag_kwh = ('mean_ea_'+period+'_kwh', wm),
pa_pop = ('dom_pop', 'sum'))
It's straightforward to build the right hand side of the aggregation equation. I want to also dynamically build the left hand side of the aggregate equations so that 'dom', 'ea', 'ag' and 'kw/kwh/thm' can be all created as variable inputs and used depending on which process I'm executing. This will significantly reduce the amount of code that needs to be written and updates will also be easier to manage as otherwise I need to write separate otherwise identical code for each combination of the above.
Can I use eval to do this? I'd appreciate guidance on how to do it. Thanks.
Adding code written after feedback from Vaidøtas I.:
index = ['PA']
lvl = 'pa'
fname = lvl+"_pop"
b = f'dfdom.groupby({index}).agg({lvl}_pop = ("dom_pop", "sum"))'
dfpab = exec(b)
The output for the above is a 'NoneType object'. If I lift the text in variable b and directly run the code as show below, I get a dataframe.
dfpab = dfdom.groupby(['PA']).agg(pa_pop = ("dom_pop", "sum"))
(I've simplified my original example to better connect with the second code added.)
Use exec(), eval() is something different
For example:
exec(f"variable_name{added_namepart} = variable_value{added_valuepart}")
I'm trying to read data from a csv and then process it on different way. (For starter just the average)
Data
(OneDrive) https://1drv.ms/u/s!ArLDiUd-U5dtg0teQoKGguBA1qt9?e=6wlpko
The data looks like this:
ID; Property1; Property2; Property3...
1; ....
1; ...
1; ...
2; ...
2; ...
3; ...
...
Every line is a GPS point. All points with same ID together (for example 1) produce one Route. The routes are not of the same length and some IDs are skipped. So it isn't a seamless increase of numbers.
I may need to add, that the points are ALWAYS the same set of meters apart from each other. And I don't need the XY information currently.
Wanted Result
In the end I want something like this:
[ID, AVG_Property1, AVG_Property2...] [1, 1.00595, 2.9595, ...] [2,1.50606, 1.5959, ...]
What I got so far
import os
import numpy
import pandas as pd
data = pd.read_csv(os.path.join('C:\\data' ,'data.csv'), sep=';')
# [id, len, prop1, prop2, ...]
routes = numpy.zeros((data.size, 10)) # 10 properties
sums = numpy.zeros(8)
nr_of_entries = 0;
current_id = 1;
for index, row in data.iterrows():
if(int(row['id']) != current_id): #after the last point of the route
routes[current_id-1][0] = current_id;
routes[current_id-1][1] = nr_of_entries; #how many points are in this route?
routes[current_id-1][2] = sums[0] / nr_of_entries;
routes[current_id-1][3] = sums[1] / nr_of_entries;
routes[current_id-1][4] = sums[2] / nr_of_entries;
routes[current_id-1][5] = sums[3] / nr_of_entries;
routes[current_id-1][6] = sums[4] / nr_of_entries;
routes[current_id-1][7] = sums[5] / nr_of_entries;
routes[current_id-1][8] = sums[6] / nr_of_entries;
routes[current_id-1][9] = sums[7] / nr_of_entries;
current_id = int(row['id']);
sums = numpy.zeros(8)
nr_of_entries = 0;
sums[0] += row[3];
sums[1] += row[4];
sums[2] += row[5];
sums[3] += row[6];
sums[4] += row[7];
sums[5] += row[8];
sums[6] += row[9];
sums[7] += row[10];
nr_of_entries = nr_of_entries + 1;
routes
My problem
1.) The way I did it, I have to copy paste the same code for every other processing approach, since as stated I need to do multiple different way. Average is just an example.
2.) The reading of the data is clumsy and fails when IDs are missing
3.) I'm a C# Developer, so my approach would be to create a Class 'Route' which has all the points and then provide methods for 'calculate average for prop 1'. Or something. This way I could also tweak the data if needed. (extreme values for example). But I have no idea how this would be done in Phyton and if this is a reasonable approach in this language.
4.) Is there a more elegant way to iterate through the original csv and getting like Route ID 1, then Route ID 2 and so on? Maybe something like LINQ Queries in C#?
Thanks for any help.
He is a solution and some ideas you can use. The example features multiple options for the same issue so you have to choose which fits the purpose best. Also it is Python 3.7, you didn't specify a version so i hope this works.
class Route(object):
"""description of class"""
def __init__(self, id, rawdata): # on startup
self.id = id
self.rawdata = rawdata
self.avg_Prop1 = self.calculate_average('Prop1')
self.sum_Prop4 = None
def calculate_average(self, Prop_Name): #selfreference for first argument in class method
return self.rawdata[Prop_Name].mean()
def give_Prop_data(self, Prop_Name): #return the Propdata as list
return self.rawdata[Prop_Name].tolist()
def any_function(self, my_function, Prop_Name): #not sure what dataframes support so turning it into a list first
return my_function(self.rawdata[Prop_Name].tolist())
#end of class definiton
data = pd.read_csv('testdata.csv', sep=';')
# [id, len, prop1, prop2, ...]
route_list = [] #List of all the objects created from the route class
for i in data.id.unique():
print('Current id:', i,' with ',len(data[data['id']==i]),'entries')
route_list.append(Route(i,data[data['id']==i]))
#created the Prop1 average in initialization of route so just accessing attribute
print(route_list[1].avg_Prop1)
for current_route in route_list:
print('Route ',current_route.id , ' Properties :')
for i in current_route.rawdata.columns[1:]: #for all except the first (id)
print(i, ' has average ', current_route.calculate_average(i)) #i is the string of the column not just an id
#or pass any function that you want
route_list[1].sum_Prop4 = (route_list[1].any_function(sum,'Prop4'))
print(route_list[1].sum_Prop4)
#which is equivalent to
print(sum(route_list[1].rawdata['Prop4']))
To adress your individual problems out of order:
For 2. and 4.) Looping only over the existing Ids (data.id.unique()) solves the problem. I have no idea what LINQ Queries are, but i assume they are similar. In general, Python has a great way of looping over objects (like for current_route in route_list), which is worth looking into if you want to use it a little more.
For 1. and 3.) Again looping solves the issue. I created a class in the example, mostly to show the syntax for classes. The benefits and drawbacks for using classes should be the same in Python as in C#.
As it is right now the class probably isn't great, but this depends on how you want to use it. If the class should just be a practical way of storing and accessing data it shouldn't have the methods, because you don't need an individual average method for each route. Then you can just access it's data and use it in a function like in sum(route_list[1].rawdata['Prop4']). If however, depending on the data (amount of rows for example) different calculations are necessary, it might come in handy to use the method calculate_average and differentiate in there.
An other example would be the use of the attributes. If you need the average for Prop1 every time, creating it at the initialization sees a good idea, otherwise i wouldn't bother always calculating it.
I hope this helps!
I have a time series dataframe with about 10 columns where I am performing manipulations on the time series to return results of strategy data. I would like to test 2 parameters as they may or may not effect each other. When tested independently, each run take over 10 sec per unit(over 6.5 hours for the total run) and I'm looking to speed this up..I have been reading about dask and it seems that its the right module to use.
My current code iterates over each parameter range with a nested loops. I know it can be paralleled as the data per day is mutually exclusive.
Here is the code:
amount1=np.arange(.001,.03,.0005)
amount2=np.arange(.001,.03,.0005)
def getResults(df,amount1,amount2):
final_results=[]
for x in tqdm(amount1):
for y in amount2:
df1=None
df1=function1(df.copy(), x, y ) #takes about 2sec.
df1=function2(df1) #takes about 2sec.
df1=function3(df1) #takes about 3sec.
final_results.append([x,y,df1['results'].iloc[-1]])
return final_results
UPDATE:
So it looks like the improvements should come by adjusting the function to remove the iteration from the calls and to create a list of jobs(my understanding. Here is where I am so far. I probably will need to move my df to a dask dataframe, so that the data can be chunked into smaller pieces. The question is do I leave the function1,2 and 3 functions as pandas vector manulipulations or do they need to move to complete dask functions?
def getResults(df,amount):
df1=None
df1=dsk.delayed(function1)(df,amount[0],amount[1] )
df1=dsk.delayed(function2)(df1)
df1=dsk.delayed(function2)(df1)
return [amount[0],amount[1],df1['results'].iloc[-1]]
#Create a list of processes from jobs. jobs is a list of tuples that replaces the iteration.
processes =[getResults(df,items) for items in jobs]
#Create a process list of results
results=[]
for i in range(len(processes):
results.append(processes[i])
You probably want to use either dask.delayed or the concurrent.futures interface.
Something like the following would probably work well (untested, I recommend that you read the docs referenced above to understand what it's doing).
def getResults(df,amount1,amount2):
final_results=[]
for x in amount1:
for y in amount2:
df1=None
df1=dask.delayed(function1)(df.copy(), x, y )
df1=dask.delayed(function2)(df1)
df1=dask.delayed(function3)(df1)
final_results.append([x,y,df1['results'].iloc[-1]])
return final_results
out = getResults(df, amount1, amount2)
result = delayed(out).compute()
Also, I would avoid calling df.copy() if you can avoid it. Ideally function1 would not mutate input data.
I have the following problem: I have two sets of data (set T and set F). And the following functions:
x(T) = arctan(T-c0), A(x(T)) = arctan(x(T) -c1),
B(x(T)) = arctan(x(T) -c2)
and Y(x(t),F) = ((A(x(t)) - B(x(t)))/2 - A(x(t))arctan(F-c3) + B(x(t))arctan(F-c4))
# where c0,c1,c2,c3,c4 are constants
Now I want to create a surface plot of Y. And for that I would like to implement Y as a python (numpy) function what turns out to be quite complicated, because Y takes other functions as input.
Another idea of mine was to evaluate x, B and A on the data separately and store the results in numpy arrays. With those I also could get the output of the function Y , but I don't know which way is better in order to plot the data and I really would like to know how to write Y as a python function.
Thank you very much for your help
It is absolutely possible to use functions as input parameters to other functions. A use case could look like:
def plus_one(standard_input_parameter_like_int):
return standard_input_parameter_like_int + 1
def apply_function(function_as_input, standard_input_parameter):
return function_as_input(standard_input_parameter)
if(__name__ == '__main__'):
print(apply_function(plus_one, 1))
I hope that helps to solve your specific problem.
[...] somethin like def s(x,y,z,*args,*args2): will yield an
error.
This is perfectly normal as (at least as far as I know) there is only one variable length non-keyword argument list allowed per function (that has to be exactly labeled as *args). So if you remove the asterisks (*) you should actually be able to run s properly.
Regarding your initial question you could do something like:
c = [0.2,-0.2,0,0,0,0]
def x(T):
return np.arctan(T-c[0])
def A(xfunc,T):
return np.arctan(xfunc(T) - c[1])
def B(xfunc,T):
return np.arctan(xfunc(T) - c[2])
def Y(xfunc,Afunc,Bfunc,t,f):
return (Afunc(xfunc,t) - Bfunc(xfunc,t))/2.0 - Afunc(xfunc,t) * np.arctan(f - c[3]) + Bfunc(xfunc,t)*np.arctan(f-c[4])
_tSet = np.linspace(-1,1,20)
_fSet = np.arange(-1,1,20)
print Y(x,A,B,_tSet,_fSet)
As you can see (and probably already tested by yourself judging from your comment) you can use functions as arguments. And as long as you don't use any 'if' conditions or other non-vectorized functions in your 'sub'-functions the top-level function should already be vectorized.