create new dataframe using function - python

I ran into this nice blogpost : https://towardsdatascience.com/the-search-for-categorical-correlation-a1c.
The author creates a function that allows you to calculate associations between categorical features and then create a heatmap out of it.
The function is given as:
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x,y)
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
rcorr = r-((r-1)**2)/(n-1)
kcorr = k-((k-1)**2)/(n-1)
return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))
I am able to create a list of associations between one feature and the rest by running the function in a for loop.
for item in raw[categorical].columns.tolist():
value = cramers_v(raw['status_group'], raw[item])
print(item, value)
It works in the sense that I get a list of association values
but I don't know how I would run this function for all features against eachother and turn that into a new dataframe.
The author of this article has written a nice new library that has this feature built in, but it doesn't turn out nicely for my long list of features (my laptop can't handle it).
Running it on the first 100 lines of my df results in this... (note: this is what I get by running the associations function of the dython library written by the author).
How could I run the cramers_v function for all combinations of features and then turn this into a df which I could display in a heatmap?

Related

Calculate Krippendorff Alpha for Multi-label Annotation

How can I calculate Krippendorff Alpha for multi-label annotations?
In case of multi-class annotation (assuming that 3 coders have annotated 4 texts with 3 labels: a, b, c), I construct first the reliability data matrix and then coincidences and based on the coincidences I can calculate Alpha:
The question is how I can prepare the coincidences and calculate alpha in case of multi-label classification problem like the following case?
Python implementation or even excel would be appreciated.
Came across your question looking for similar information. We used the below code, with nltk.agreement for the metrics and pandas_ods_reader to read the data from a LibreOffice spreadsheet. Our data has two annotators, and for some of the items there can be two labels (for instance, one coder annotated one label only and the other coder annotated two labels instead).
The spreadsheet screencap below shows the structure of the input data. The column for annotation items is called annotItems, and annotation columns are called coder1 and coder2. The separator when there's more than one label is a pipe, unlike the comma in your example.
The code is inspired by this SO post: Low alpha for NLTK agreement using MASI distance
[Spreadsheet screencap]
from nltk import agreement
from nltk.metrics.distance import masi_distance
from nltk.metrics.distance import jaccard_distance
import pandas_ods_reader as pdreader
annotfile = "test-iaa-so.ods"
df = pdreader.read_ods(annotfile, "Sheet1")
annots = []
def create_annot(an):
"""
Create frozensets with the unique label
or with both labels splitting on pipe.
Unique label has to go in a list so that
frozenset does not split it into characters.
"""
if "|" in str(an):
an = frozenset(an.split("|"))
else:
# single label has to go in a list
# need to cast or not depends on your data
an = frozenset([str(int(an))])
return an
for idx, row in df.iterrows():
annot_id = row.annotItem + str.zfill(str(idx), 3)
annot_coder1 = ['coder1', annot_id, create_annot(row.coder1)]
annot_coder2 = ['coder2', annot_id, create_annot(row.coder2)]
annots.append(annot_coder1)
annots.append(annot_coder2)
# based on https://stackoverflow.com/questions/45741934/
jaccard_task = agreement.AnnotationTask(distance=jaccard_distance)
masi_task = agreement.AnnotationTask(distance=masi_distance)
tasks = [jaccard_task, masi_task]
for task in tasks:
task.load_array(annots)
print("Statistics for dataset using {}".format(task.distance))
print("C: {}\nI: {}\nK: {}".format(task.C, task.I, task.K))
print("Pi: {}".format(task.pi()))
print("Kappa: {}".format(task.kappa()))
print("Multi-Kappa: {}".format(task.multi_kappa()))
print("Alpha: {}".format(task.alpha()))
For the data in the screencap linked from this answer, this would print:
Statistics for dataset using <function jaccard_distance at 0x7fa1464b6050>
C: {'coder1', 'coder2'}
I: {'item3002', 'item1000', 'item6005', 'item5004', 'item2001', 'item4003'}
K: {frozenset({'1'}), frozenset({'0'}), frozenset({'0', '1'})}
Pi: 0.1818181818181818
Kappa: 0.35714285714285715
Multi-Kappa: 0.35714285714285715
Alpha: 0.02941176470588236
Statistics for dataset using <function masi_distance at 0x7fa1464b60e0>
C: {'coder1', 'coder2'}
I: {'item3002', 'item1000', 'item6005', 'item5004', 'item2001', 'item4003'}
K: {frozenset({'1'}), frozenset({'0'}), frozenset({'0', '1'})}
Pi: 0.09181818181818181
Kappa: 0.2864285714285714
Multi-Kappa: 0.2864285714285714
Alpha: 0.017962466487935425

Dynamic creation of fieldname in pandas groupby-aggregate

I have many aggregations of the type below in my code:
period = 'ag'
index = ['PA']
lvl = 'pa'
wm = lambda x: np.average(x, weights=dfdom.loc[x.index, 'pop'])
dfpa = dfdom[(dfdom['stratum_kWh'] !=8)].groupby(index).agg(
pa_mean_ea_ag_kwh = ('mean_ea_'+period+'_kwh', wm),
pa_pop = ('dom_pop', 'sum'))
It's straightforward to build the right hand side of the aggregation equation. I want to also dynamically build the left hand side of the aggregate equations so that 'dom', 'ea', 'ag' and 'kw/kwh/thm' can be all created as variable inputs and used depending on which process I'm executing. This will significantly reduce the amount of code that needs to be written and updates will also be easier to manage as otherwise I need to write separate otherwise identical code for each combination of the above.
Can I use eval to do this? I'd appreciate guidance on how to do it. Thanks.
Adding code written after feedback from Vaidøtas I.:
index = ['PA']
lvl = 'pa'
fname = lvl+"_pop"
b = f'dfdom.groupby({index}).agg({lvl}_pop = ("dom_pop", "sum"))'
dfpab = exec(b)
The output for the above is a 'NoneType object'. If I lift the text in variable b and directly run the code as show below, I get a dataframe.
dfpab = dfdom.groupby(['PA']).agg(pa_pop = ("dom_pop", "sum"))
(I've simplified my original example to better connect with the second code added.)
Use exec(), eval() is something different
For example:
exec(f"variable_name{added_namepart} = variable_value{added_valuepart}")

Issue with Python list interaction with for loop

I am having a problem with my genetic feature optimization algorithm that I am attempting to build. The idea is that a specific combination of features will be tested and if the model accuracy using those features is higher than the previous maximum, then the combination of features replaces the previous maximum combination. through running through the remaining potential features in this way, the final combination should be the optimal combination of features given the feature vector size. Currently, the code that looks to achieve this looks like:
def mutate_features(features, feature):
new_features = features
index = random.randint(0,len(features)-1)
new_features[index] = feature
return new_features
def run_series(n, f_list, df):
features_list = []
results_list = []
max_results_list = [[0,0,0,0,0]]
max_feature_list = []
features = [0,0,0,0,1]
for i in range(0,5): # 5 has just been chosen as the range for testing purposes
results = run_algorithm(df, f_list, features)
features_list.append(features)
results_list.append(results)
if (check_result_vector(max_results_list, results)):
max_results_list.append(results)
max_feature_list.append(features)
else:
print("Revert to previous :" +str(max_feature_list[-1]))
features = max_feature_list[-1]
features = mutate_features(features, f_list[i])
print("Feature List = " +str(features_list))
print("Results List = " +str(results_list))
print("Max Results List = " +str(max_results_list))
print("Max Feature List = "+str(max_feature_list))
The output from this code has been included below;
Click to zoom or enlarge the photo
The section that I do not understand is the output of the max_feature_list and feature_list.
If anything is added through the use of .append() to the max_feature_list or the feature_list inside the for loop, it seems to change all items that are already members of the list to be the same as the latest addition to the list. I may not be fully understanding of the syntax/logic around this and would really appreciate any feedback as to why the program is doing this.
It happens because you change the values of features inside mutate_features function and then, since the append to max_feature_list is by reference, the populated values in max_feature_list are changing too because their underlying value changed.
One way to prevent such behaviour is to deepcopy features inside mutate_features, mutate the copied features as you want and then return it.
For example:
from copy import deepcopy
def mutate_features(features, feature):
new_features = deepcopy(features)
index = random.randint(0,len(features)-1)
new_features[index] = feature
return new_features
features = [1, 2, 3]
res = []
res.append(features)
features = mutate_features(features, feature)
res.append(features)
print(res)

parallel processing - nearest neighbour search using pysal python?

I have this data frame df1,
id lat_long
400743 2504043 (175.0976323, -41.1141412)
43203 1533418 (173.976683, -35.2235338)
463952 3805508 (174.6947496, -36.7437555)
1054906 3144009 (168.0105269, -46.36193)
214474 3030933 (174.6311167, -36.867717)
1008802 2814248 (169.3183615, -45.1859095)
988706 3245376 (171.2338968, -44.3884099)
492345 3085310 (174.740957, -36.8893026)
416106 3794301 (174.0106383, -35.3876921)
937313 3114127 (174.8436185, -37.80499)
I have constructed the tree for search here,
def construct_geopoints(s):
data_geopoints = [tuple(x) for x in s[['longitude','latitude']].to_records(index=False)]
tree = KDTree(data_geopoints, distance_metric='Arc', radius=pysal.cg.RADIUS_EARTH_KM)
return tree
tree = construct_geopoints(actualdata)
Now, I am trying to search all the geopoints which are within 1KM of every geopoint in my data frame df1. Here is how I am doing,
dfs = []
for name,group in df1.groupby(np.arange(len(df1))//10000):
s = group.reset_index(drop=True).copy()
pts = list(s['lat_long'])
neighbours = tree.query_ball_point(pts, 1)
s['neighbours'] = pd.Series(neighbours)
dfs.append(s)
output = pd.concat(dfs,axis = 0)
Everything here works fine, however I am trying to parallelise this task, since my df1 size is 2M records, this process is running for more than 8 hours. Can anyone help me on this? And another thing is, the result returned by query_ball_point is a list and so its throwing memory error when I am processing it for the huge amount of records. Any way to handle this.
EDIT :- Memory issue, look at the VIRT size.
It should be possible to parallelize your last segment of code with something like this:
from multiprocessing import Pool
...
def process_group(group):
s = group[1].reset_index(drop=True) # .copy() is implicit
pts = list(s['lat_long'])
neighbours = tree.query_ball_point(pts, 1)
s['neighbours'] = pd.Series(neighbours)
return s
groups = df1.groupby(np.arange(len(df1))//10000)
p = Pool(5)
dfs = p.map(process_group, groups)
output = pd.concat(dfs, axis=0)
But watch out, because the multiprocessing library pickles all the data on its way to and from the workers, and that can add a lot of overhead for data-intensive tasks, possibly cancelling the savings due to parallel processing.
I can't see where you'd be getting out-of-memory errors from. 8 million records is not that much for pandas. Maybe if your searches are producing hundreds of matches per row that could be a problem. If you say more about that I might be able to give some more advice.
It also sounds like pysal may be taking longer than necessary to do this. You might be able to get better performance by using GeoPandas or "rolling your own" solution like this:
assign each point to a surrounding 1-km grid cell (e.g., calculate UTM coordinates x and y, then create columns cx=x//1000 and cy=y//1000);
create an index on the grid cell coordinates cx and cy (e.g., df=df.set_index(['cx', 'cy']));
for each point, find the points in the 9 surrounding cells; you can select these directly from the index via df.loc[[(cx-1,cy-1),(cx-1,cy),(cx-1,cy+1),(cx,cy-1),...(cx+1,cy+1)], :];
filter the points you just selected to find the ones within 1 km.

Financial modelling with Pandas dataframe

I have built up a simple DCF model mainly through Pandas. Basically all calculations happen in a single dataframe. I want to find a better coding style as the model becomes more complex and more variables have been added to the model. The following example may illustrate my current coding style - simple and straightforward.
# some customized formulas
def GrowthRate():
def BoundedVal()
....
# some operations
df['EBIT'] = df['revenue'] - df['costs']
df['NI'] = df['EBIT'] - df['tax'] - df['interests']
df['margin'] = df['NI'] / df['revenue']
I loop through all years to calculate values. Now I have added over 500 variables to the model and calculation also becomes more complex. I was thinking to create a separate def for each variable and update the main df accordingly. So the above code would become:
def EBIT(t):
df['EBIT'][t] = df['revenue'][t] - df['costs'][t]
#....some more ops
return df['EBIT'][t]
def NI(t):
df['NI'][t] = EBIT(t) - df['tax'][t] - df['interests'][t]
#....some more ops
return df['NI'][t]
def margin(t):
if check_df_is_nan():
df['margin'][t] = NI(t) - df['costs'][t]
#....some more ops
return df['margin'][t]
else:
return df['margin'][t]
Each function is able to 1) calculate results and update df 2)return value if called by other functions.
To avoid redundant calculation (think if margin(t) is called by multiple times), it would be better to add a "check if val has been calculated before" function to each def.
My question: 1) is it possible to add the if statement to a group of defs? similar to the if clause above.
2) I have over 50 custom defs so the main file becomes too long. I cannot simply move all defs to another file and import all because some defs also refer to the dataframe in the main file. Any suggestions? Can I set the df as a global variable so defs from other files are able to modify and update?
For 1, just check if the value is NaN or not.
import pandas as pd
def EBIT(t):
if pd.notnull(df['EBIT'][t]):
return df['EBIT'][t]
df['EBIT'][t] = df['revenue'][t] - df['costs'][t]
...
For 2, using global variable might work, but it's a bad approach. You should really try to avoid using them whenever possible.
What you should do is instead make each function take the global data frame as an argument. Then you can pass in the data frame you want to operate on.
# in some other file
def EBIT(df, t):
# logic goes here
# in the main file
import operations as op
# ...
op.EBIT(df, t)
enter code here
P.S. Have you consider doing operation on the whole column at once rather using t? It should be much faster.

Categories

Resources