Dynamic creation of fieldname in pandas groupby-aggregate - python

I have many aggregations of the type below in my code:
period = 'ag'
index = ['PA']
lvl = 'pa'
wm = lambda x: np.average(x, weights=dfdom.loc[x.index, 'pop'])
dfpa = dfdom[(dfdom['stratum_kWh'] !=8)].groupby(index).agg(
pa_mean_ea_ag_kwh = ('mean_ea_'+period+'_kwh', wm),
pa_pop = ('dom_pop', 'sum'))
It's straightforward to build the right hand side of the aggregation equation. I want to also dynamically build the left hand side of the aggregate equations so that 'dom', 'ea', 'ag' and 'kw/kwh/thm' can be all created as variable inputs and used depending on which process I'm executing. This will significantly reduce the amount of code that needs to be written and updates will also be easier to manage as otherwise I need to write separate otherwise identical code for each combination of the above.
Can I use eval to do this? I'd appreciate guidance on how to do it. Thanks.
Adding code written after feedback from Vaidøtas I.:
index = ['PA']
lvl = 'pa'
fname = lvl+"_pop"
b = f'dfdom.groupby({index}).agg({lvl}_pop = ("dom_pop", "sum"))'
dfpab = exec(b)
The output for the above is a 'NoneType object'. If I lift the text in variable b and directly run the code as show below, I get a dataframe.
dfpab = dfdom.groupby(['PA']).agg(pa_pop = ("dom_pop", "sum"))
(I've simplified my original example to better connect with the second code added.)

Use exec(), eval() is something different
For example:
exec(f"variable_name{added_namepart} = variable_value{added_valuepart}")

Related

create new dataframe using function

I ran into this nice blogpost : https://towardsdatascience.com/the-search-for-categorical-correlation-a1c.
The author creates a function that allows you to calculate associations between categorical features and then create a heatmap out of it.
The function is given as:
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x,y)
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
rcorr = r-((r-1)**2)/(n-1)
kcorr = k-((k-1)**2)/(n-1)
return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))
I am able to create a list of associations between one feature and the rest by running the function in a for loop.
for item in raw[categorical].columns.tolist():
value = cramers_v(raw['status_group'], raw[item])
print(item, value)
It works in the sense that I get a list of association values
but I don't know how I would run this function for all features against eachother and turn that into a new dataframe.
The author of this article has written a nice new library that has this feature built in, but it doesn't turn out nicely for my long list of features (my laptop can't handle it).
Running it on the first 100 lines of my df results in this... (note: this is what I get by running the associations function of the dython library written by the author).
How could I run the cramers_v function for all combinations of features and then turn this into a df which I could display in a heatmap?

Anova Analysis Python_Urgent

I hope I can be as clear as possible.
I have an excel file with 400 subjects for a study and for each one of them I have their age, their sex and 40 more columns of biological variables.
Es: CODE0001; (age)20; M\F; Biovalue1; BioValue 2 ..... Biovalue 40.
My goal is to analyze these data with the 1-way Anova because I think it's the best option I have. I'm trying do it (even using this guide https://www.marsja.se/four-ways-to-conduct-one-way-anovas-using-python/ ) but there's always a problem with the code.
So: how can I set up my data in order to be able to use the code for example from that website?
I've already done Dataset.mean() and Dataset.std() for all the data, but I can't use for example the value "Mean Age" because it seems like Jupyter only reads it as a string and not a value.
I'm in a deep state of confusion, so all kind of help will be super appreciated!!!
Thank you in advance
I'm sorry but I didn't understand. I'm relatively new to python so maybe i couldn't explain myself properly.
I need to do an Anova analysis:
First I did this:
AnalisiISAD.mean()
2) Then I made a list from that:
MeanList = [......]
3) Then i proceded with the anova script
AnalisiI.boxplot('MeanList', by='AgeT0', figsize=(12,8))
ctrl = Analisi['MeanList'][Analisi == 'ctrl']
grps = pd.unique(Analisi.group.values)
d_data = {grp:Analisi['MeanList'][Analisi.group ==grp] for grp in grps}
k = len(pd.unique(Analisi.group))
N = len(Analisi.values)
n = Analisi.groupby('AgeT0').size()[0]
but this error occurs: KeyError: 'Column not found: MeanList'
Does this mean I have to create a new column in the excel file? How do I do that?
When using df.mean() or df.std(), try changing the data to pd.Series first and run it.

Python: Data-structure and processing of GPS points and properties

I'm trying to read data from a csv and then process it on different way. (For starter just the average)
Data
(OneDrive) https://1drv.ms/u/s!ArLDiUd-U5dtg0teQoKGguBA1qt9?e=6wlpko
The data looks like this:
ID; Property1; Property2; Property3...
1; ....
1; ...
1; ...
2; ...
2; ...
3; ...
...
Every line is a GPS point. All points with same ID together (for example 1) produce one Route. The routes are not of the same length and some IDs are skipped. So it isn't a seamless increase of numbers.
I may need to add, that the points are ALWAYS the same set of meters apart from each other. And I don't need the XY information currently.
Wanted Result
In the end I want something like this:
[ID, AVG_Property1, AVG_Property2...] [1, 1.00595, 2.9595, ...] [2,1.50606, 1.5959, ...]
What I got so far
import os
import numpy
import pandas as pd
data = pd.read_csv(os.path.join('C:\\data' ,'data.csv'), sep=';')
# [id, len, prop1, prop2, ...]
routes = numpy.zeros((data.size, 10)) # 10 properties
sums = numpy.zeros(8)
nr_of_entries = 0;
current_id = 1;
for index, row in data.iterrows():
if(int(row['id']) != current_id): #after the last point of the route
routes[current_id-1][0] = current_id;
routes[current_id-1][1] = nr_of_entries; #how many points are in this route?
routes[current_id-1][2] = sums[0] / nr_of_entries;
routes[current_id-1][3] = sums[1] / nr_of_entries;
routes[current_id-1][4] = sums[2] / nr_of_entries;
routes[current_id-1][5] = sums[3] / nr_of_entries;
routes[current_id-1][6] = sums[4] / nr_of_entries;
routes[current_id-1][7] = sums[5] / nr_of_entries;
routes[current_id-1][8] = sums[6] / nr_of_entries;
routes[current_id-1][9] = sums[7] / nr_of_entries;
current_id = int(row['id']);
sums = numpy.zeros(8)
nr_of_entries = 0;
sums[0] += row[3];
sums[1] += row[4];
sums[2] += row[5];
sums[3] += row[6];
sums[4] += row[7];
sums[5] += row[8];
sums[6] += row[9];
sums[7] += row[10];
nr_of_entries = nr_of_entries + 1;
routes
My problem
1.) The way I did it, I have to copy paste the same code for every other processing approach, since as stated I need to do multiple different way. Average is just an example.
2.) The reading of the data is clumsy and fails when IDs are missing
3.) I'm a C# Developer, so my approach would be to create a Class 'Route' which has all the points and then provide methods for 'calculate average for prop 1'. Or something. This way I could also tweak the data if needed. (extreme values for example). But I have no idea how this would be done in Phyton and if this is a reasonable approach in this language.
4.) Is there a more elegant way to iterate through the original csv and getting like Route ID 1, then Route ID 2 and so on? Maybe something like LINQ Queries in C#?
Thanks for any help.
He is a solution and some ideas you can use. The example features multiple options for the same issue so you have to choose which fits the purpose best. Also it is Python 3.7, you didn't specify a version so i hope this works.
class Route(object):
"""description of class"""
def __init__(self, id, rawdata): # on startup
self.id = id
self.rawdata = rawdata
self.avg_Prop1 = self.calculate_average('Prop1')
self.sum_Prop4 = None
def calculate_average(self, Prop_Name): #selfreference for first argument in class method
return self.rawdata[Prop_Name].mean()
def give_Prop_data(self, Prop_Name): #return the Propdata as list
return self.rawdata[Prop_Name].tolist()
def any_function(self, my_function, Prop_Name): #not sure what dataframes support so turning it into a list first
return my_function(self.rawdata[Prop_Name].tolist())
#end of class definiton
data = pd.read_csv('testdata.csv', sep=';')
# [id, len, prop1, prop2, ...]
route_list = [] #List of all the objects created from the route class
for i in data.id.unique():
print('Current id:', i,' with ',len(data[data['id']==i]),'entries')
route_list.append(Route(i,data[data['id']==i]))
#created the Prop1 average in initialization of route so just accessing attribute
print(route_list[1].avg_Prop1)
for current_route in route_list:
print('Route ',current_route.id , ' Properties :')
for i in current_route.rawdata.columns[1:]: #for all except the first (id)
print(i, ' has average ', current_route.calculate_average(i)) #i is the string of the column not just an id
#or pass any function that you want
route_list[1].sum_Prop4 = (route_list[1].any_function(sum,'Prop4'))
print(route_list[1].sum_Prop4)
#which is equivalent to
print(sum(route_list[1].rawdata['Prop4']))
To adress your individual problems out of order:
For 2. and 4.) Looping only over the existing Ids (data.id.unique()) solves the problem. I have no idea what LINQ Queries are, but i assume they are similar. In general, Python has a great way of looping over objects (like for current_route in route_list), which is worth looking into if you want to use it a little more.
For 1. and 3.) Again looping solves the issue. I created a class in the example, mostly to show the syntax for classes. The benefits and drawbacks for using classes should be the same in Python as in C#.
As it is right now the class probably isn't great, but this depends on how you want to use it. If the class should just be a practical way of storing and accessing data it shouldn't have the methods, because you don't need an individual average method for each route. Then you can just access it's data and use it in a function like in sum(route_list[1].rawdata['Prop4']). If however, depending on the data (amount of rows for example) different calculations are necessary, it might come in handy to use the method calculate_average and differentiate in there.
An other example would be the use of the attributes. If you need the average for Prop1 every time, creating it at the initialization sees a good idea, otherwise i wouldn't bother always calculating it.
I hope this helps!

Set query parameters of RFC_READ_TABLE using win32com module?

I'm trying to port to Python a SAP table download script, that already works on Excel VBA, but I want a command line version and I would prefer to avoid VBScript for a number of reasons that go beyond the goal of this post.
I'm stuck at the moment in which I need to fill the values in a table
from win32com.client import Dispatch
Functions = Dispatch("SAP.Functions")
Functions.Connection.Client = "400"
Functions.Connection.ApplicationServer = "myserver"
Functions.Connection.Language = "EN"
Functions.Connection.User = "myuser"
Functions.Connection.Password = "mypwd"
Functions.Connection.SystemNumber = "00"
Functions.Connection.UseSAPLogonIni = False
if (Functions.Connection.Logon (0,True) == True):
print("Logon OK")
RFC = Functions.Add("RFC_READ_TABLE")
RFC.exports("QUERY_TABLE").Value = "USR02"
RFC.exports("DELIMITER").Value = "~"
#RFC.exports("ROWSKIPS").Value = 2000
#RFC.exports("ROWCOUNT").Value = 10
tblOptions = RFC.Tables("OPTIONS")
#RETURNED DATA
tblData = RFC.Tables("DATA")
tblFields = RFC.Tables("FIELDS")
tblFields.AppendRow ()
print(tblFields.RowCount)
print(tblFields(1,"FIELDNAME"))
# the 2 lines above print 1 and an empty string, so the row in the table exists
Until here it is basically copied from VBA adapting the syntax.
In VBA at this point I'm able to do
tblFields(1,"FIELDNAME") = "BNAME"
if I do the same I get an error because the left part is a function and written that way it returns a string. In VBA it is probably a bi-dimensional array.
I unsuccessfully tried various approaches like
tblFields.setValue([{"FIELDNAME":"BNAME"}])
tblFields(1,"FIELDNAME").Value = "BNAME"
tblFields(1,"FIELDNAME").setValue("BNAME")
tblFields.FieldName = "BNAME" ##kinda desperate
The script works, without setting the FIELDS table, for outputs that produce rows shorter than 500 chars. This is a SAP limit in the function.
I know that this is not the best way, but I can't use the SAPNWRFC library and I can't use librfc32.dll.
I must be able to solve this way, or revert to the VB version.
Thanks to anyone who will provide a hint
After a lot of trial and error, i found a solution.
Instead of adding row by row to the "OPTIONS" or "FIELDS" tables, you can just submit a prefilled table.
This should work:
tblFields.Data = (('VBELN', '000000', '000000', '', ''),
('POSNR', '000000', '000000', '', ''))
same here:
tblOptions.Data = (("VBELN EQ '2557788'",),)

Financial modelling with Pandas dataframe

I have built up a simple DCF model mainly through Pandas. Basically all calculations happen in a single dataframe. I want to find a better coding style as the model becomes more complex and more variables have been added to the model. The following example may illustrate my current coding style - simple and straightforward.
# some customized formulas
def GrowthRate():
def BoundedVal()
....
# some operations
df['EBIT'] = df['revenue'] - df['costs']
df['NI'] = df['EBIT'] - df['tax'] - df['interests']
df['margin'] = df['NI'] / df['revenue']
I loop through all years to calculate values. Now I have added over 500 variables to the model and calculation also becomes more complex. I was thinking to create a separate def for each variable and update the main df accordingly. So the above code would become:
def EBIT(t):
df['EBIT'][t] = df['revenue'][t] - df['costs'][t]
#....some more ops
return df['EBIT'][t]
def NI(t):
df['NI'][t] = EBIT(t) - df['tax'][t] - df['interests'][t]
#....some more ops
return df['NI'][t]
def margin(t):
if check_df_is_nan():
df['margin'][t] = NI(t) - df['costs'][t]
#....some more ops
return df['margin'][t]
else:
return df['margin'][t]
Each function is able to 1) calculate results and update df 2)return value if called by other functions.
To avoid redundant calculation (think if margin(t) is called by multiple times), it would be better to add a "check if val has been calculated before" function to each def.
My question: 1) is it possible to add the if statement to a group of defs? similar to the if clause above.
2) I have over 50 custom defs so the main file becomes too long. I cannot simply move all defs to another file and import all because some defs also refer to the dataframe in the main file. Any suggestions? Can I set the df as a global variable so defs from other files are able to modify and update?
For 1, just check if the value is NaN or not.
import pandas as pd
def EBIT(t):
if pd.notnull(df['EBIT'][t]):
return df['EBIT'][t]
df['EBIT'][t] = df['revenue'][t] - df['costs'][t]
...
For 2, using global variable might work, but it's a bad approach. You should really try to avoid using them whenever possible.
What you should do is instead make each function take the global data frame as an argument. Then you can pass in the data frame you want to operate on.
# in some other file
def EBIT(df, t):
# logic goes here
# in the main file
import operations as op
# ...
op.EBIT(df, t)
enter code here
P.S. Have you consider doing operation on the whole column at once rather using t? It should be much faster.

Categories

Resources