Calculate Krippendorff Alpha for Multi-label Annotation - python

How can I calculate Krippendorff Alpha for multi-label annotations?
In case of multi-class annotation (assuming that 3 coders have annotated 4 texts with 3 labels: a, b, c), I construct first the reliability data matrix and then coincidences and based on the coincidences I can calculate Alpha:
The question is how I can prepare the coincidences and calculate alpha in case of multi-label classification problem like the following case?
Python implementation or even excel would be appreciated.

Came across your question looking for similar information. We used the below code, with nltk.agreement for the metrics and pandas_ods_reader to read the data from a LibreOffice spreadsheet. Our data has two annotators, and for some of the items there can be two labels (for instance, one coder annotated one label only and the other coder annotated two labels instead).
The spreadsheet screencap below shows the structure of the input data. The column for annotation items is called annotItems, and annotation columns are called coder1 and coder2. The separator when there's more than one label is a pipe, unlike the comma in your example.
The code is inspired by this SO post: Low alpha for NLTK agreement using MASI distance
[Spreadsheet screencap]
from nltk import agreement
from nltk.metrics.distance import masi_distance
from nltk.metrics.distance import jaccard_distance
import pandas_ods_reader as pdreader
annotfile = "test-iaa-so.ods"
df = pdreader.read_ods(annotfile, "Sheet1")
annots = []
def create_annot(an):
"""
Create frozensets with the unique label
or with both labels splitting on pipe.
Unique label has to go in a list so that
frozenset does not split it into characters.
"""
if "|" in str(an):
an = frozenset(an.split("|"))
else:
# single label has to go in a list
# need to cast or not depends on your data
an = frozenset([str(int(an))])
return an
for idx, row in df.iterrows():
annot_id = row.annotItem + str.zfill(str(idx), 3)
annot_coder1 = ['coder1', annot_id, create_annot(row.coder1)]
annot_coder2 = ['coder2', annot_id, create_annot(row.coder2)]
annots.append(annot_coder1)
annots.append(annot_coder2)
# based on https://stackoverflow.com/questions/45741934/
jaccard_task = agreement.AnnotationTask(distance=jaccard_distance)
masi_task = agreement.AnnotationTask(distance=masi_distance)
tasks = [jaccard_task, masi_task]
for task in tasks:
task.load_array(annots)
print("Statistics for dataset using {}".format(task.distance))
print("C: {}\nI: {}\nK: {}".format(task.C, task.I, task.K))
print("Pi: {}".format(task.pi()))
print("Kappa: {}".format(task.kappa()))
print("Multi-Kappa: {}".format(task.multi_kappa()))
print("Alpha: {}".format(task.alpha()))
For the data in the screencap linked from this answer, this would print:
Statistics for dataset using <function jaccard_distance at 0x7fa1464b6050>
C: {'coder1', 'coder2'}
I: {'item3002', 'item1000', 'item6005', 'item5004', 'item2001', 'item4003'}
K: {frozenset({'1'}), frozenset({'0'}), frozenset({'0', '1'})}
Pi: 0.1818181818181818
Kappa: 0.35714285714285715
Multi-Kappa: 0.35714285714285715
Alpha: 0.02941176470588236
Statistics for dataset using <function masi_distance at 0x7fa1464b60e0>
C: {'coder1', 'coder2'}
I: {'item3002', 'item1000', 'item6005', 'item5004', 'item2001', 'item4003'}
K: {frozenset({'1'}), frozenset({'0'}), frozenset({'0', '1'})}
Pi: 0.09181818181818181
Kappa: 0.2864285714285714
Multi-Kappa: 0.2864285714285714
Alpha: 0.017962466487935425

Related

Speeding up fuzzy match on large list

I am working on a project that uses fuzzy logic on a list of names that could go about 100,000 unique records. On the recent screening that we have conducted, the functions that we use can complete a single name within 2.20 seconds on average. This means that on a list of 10,000 names, this process could take 6 hours, which is really too long.
Is there a way that we can speed up our process? Here's the snippet of the script that we use.
# Importing packages
import pandas as pd
import Levenshtein as lev
# Reading cleaned datasets
df_name_reference = pd.read_csv('path_to_file')
df_name_to_screen = pd.read_csv('path_to_file')
# Function used in name screening
def get_similarity_score(s1, s2):
''' Return match percentage between 2 strings disregarding name swapping
Parameters
-----------
s1 : str : name from df_name_reference (to be used within pandas apply)
s2 : str : name from df_name_to_screen (ref_name variable)
Return
-----------
float
'''
# Get sorted names
s1_sort = ' '.join(sorted(s1.split(' '))).strip() if type(s1)==str else ''
s2_sort = ' '.join(sorted(s2.split(' '))).strip() if type(s2)==str else ''
# Get ratios and return the max value
# THIS COULD BE THE BOTTLENECK OF OUR SCRIPT: MORE DETAILS BELOW
return max([
lev.ratio(s1, s2),
lev.ratio(s1_sort, s2),
lev.ratio(s1, s2_sort),
lev.ratio(s1_sort, s2_sort)
])
# Returning file
screening_results = []
for row in range(df_name_to_screen.shape[0]):
# Get name to screen
ref_name = df_name_to_screen.loc[row, 'fullname']
# Get scores
scores = df_name_reference.fullname.apply(lev.ratio, args=(ref_name,))
# Append results
screening_results.append(pd.DataFrame({'screened_name':ref_name, 'scores':scores}))
I took four scores from lev.ratio. This is to address variations in the arrangement of names, ie. firstname-lastname and lastname-firstname formats. I know that fuzzywuzzy package has token_sort_ratio, but I've noticed that it's just splitting the name parts, and sorting it alphabetically, which leads to lower scores. Plus, fuzzywuzzy is slower than Levenshtein. So, I had to manually capture the similarity score of sorted and unsorted names.
Can anyone give an approach that I could try? Thanks!
EDIT: Here's a sample dataset that you may try. This is in Google Drive.
In case you don't need scores for all entries in the reference data but just the top N then you can use difflib.get_close_matches to remove the others before calculating any scores:
screening_results = []
for row in range(df_name_to_screen.shape[0]):
ref_name = df_name_to_screen.loc[row, 'fullname']
skimmed = pd.DataFrame({
'fullname': difflib.get_close_matches(
ref_name,
df_name_reference.fullname,
N_RESULTS,
0
)
})
scores = skimmed.fullname.apply(lev.ratio, args=(ref_name,))
screening_results.append(pd.DataFrame({'screened_name': ref_name, 'scores': scores}))
This takes about 50ms per row using the file you provided.

How to split a dataset with multiple variables into train and test while both having the same composition using python?

I have a list of brain metastasis MRIs that I want to use for training and testing purposes.
These images are all similar but the original tumor sites differs. See the following example:
From Lungs:
"Image01.1"
"Image01.2"
"Image01.3"
"Image01.4"
From Breasts:
"Image02.1"
"Image02.2"
"Image02.3"
"Image02.4"
"Image02.5"
From Skin:
"Image03.1"
"Image03.2"
From Lung Tissue:
"Image04.1"
"Image04.2"
"Image04.3"
From Bone Marrow:
"Image05.1"
"Image05.2"
I want the testing and validation set to contain the same amount of images without losing a similar composition (both lists containing the same amount of each subtype).
For this purpose can I create lists for each subtype and then randomly split those 50/50. Followed by adding all these lists together?
If you want to get specific rows from a pandas DataFrame that meet certain criteria, you can filter. In your case, something like:
reader_lung = reader[reader["Image_Title"] == "Lung"]
"Image_Title" you need to change to the name of the column you're looking for your keyword (e.g., Lung) in. This needs to be an exact match.
For something that doesn't require an exact match, you could also do the following:
reader_lung = reader[reader["Image_Title"].str.contains("Lung")]
Could you create a list of lists (one for each type) and then take the first N and put them into training and the last N and put them in test?
Something like this pseudocode:
with open(r"B:/.../excell.csv", newline='') as f:
reader = csv.reader(f, dialect="excel",delimiter=';')
test = []
training = []
type_map = {}
for row in reader:
if row[33] in type_map:
# If the type has already been viewed, append to the existing list of those images
type_map[row[33]].append(row)
else:
# If this type is seen for the first time, create a new array with that row in it
type_map[row[33]] = [row]
# Now you should have a map like : {"Lung": ["image1", "image2" ...], "Heart": ["imageA"....]}
for image_type in type_map:
type_images = type_map[image_type]
half_way_index = len(type_images)/2 # For odd elements i.e 13 elems this will give you 6 (integer division)
test += type_images[0:half_way_index] # First half of the type_images are test
training += type_images[half_way_index:(half_way_index*2)] # Second half are training

create new dataframe using function

I ran into this nice blogpost : https://towardsdatascience.com/the-search-for-categorical-correlation-a1c.
The author creates a function that allows you to calculate associations between categorical features and then create a heatmap out of it.
The function is given as:
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x,y)
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
rcorr = r-((r-1)**2)/(n-1)
kcorr = k-((k-1)**2)/(n-1)
return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))
I am able to create a list of associations between one feature and the rest by running the function in a for loop.
for item in raw[categorical].columns.tolist():
value = cramers_v(raw['status_group'], raw[item])
print(item, value)
It works in the sense that I get a list of association values
but I don't know how I would run this function for all features against eachother and turn that into a new dataframe.
The author of this article has written a nice new library that has this feature built in, but it doesn't turn out nicely for my long list of features (my laptop can't handle it).
Running it on the first 100 lines of my df results in this... (note: this is what I get by running the associations function of the dython library written by the author).
How could I run the cramers_v function for all combinations of features and then turn this into a df which I could display in a heatmap?

Convert Python dictionary to Word2Vec object

I have obtained a dictionary mapping words to their vectors in python, and I am trying to scatter plot the n most similar words since TSNE on huge number of words is taking forever. The best option is to convert the dictionary to a w2v object to deal with it.
I had the same issue and I finaly found the solution
So, I assume that your dictionary looks like mine
d = {}
d['1'] = np.random.randn(300)
d['2'] = np.random.randn(300)
Basically, the keys are the users' ids and each of them has a vector with shape (300,).
So now, in order to use it as word2vec I need to firstly save it to binary file and then load it with gensim library
from numpy import zeros, dtype, float32 as REAL, ascontiguousarray, fromstring
from gensim import utils
m = gensim.models.keyedvectors.Word2VecKeyedVectors(vector_size=300)
m.vocab = d
m.vectors = np.array(list(d.values()))
my_save_word2vec_format(binary=True, fname='train.bin', total_vec=len(d), vocab=m.vocab, vectors=m.vectors)
Where my_save_word2vec_format function is:
def my_save_word2vec_format(fname, vocab, vectors, binary=True, total_vec=2):
"""Store the input-hidden weight matrix in the same format used by the original
C word2vec-tool, for compatibility.
Parameters
----------
fname : str
The file path used to save the vectors in.
vocab : dict
The vocabulary of words.
vectors : numpy.array
The vectors to be stored.
binary : bool, optional
If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
total_vec : int, optional
Explicitly specify total number of vectors
(in case word vectors are appended with document vectors afterwards).
"""
if not (vocab or vectors):
raise RuntimeError("no input")
if total_vec is None:
total_vec = len(vocab)
vector_size = vectors.shape[1]
assert (len(vocab), vector_size) == vectors.shape
with utils.smart_open(fname, 'wb') as fout:
print(total_vec, vector_size)
fout.write(utils.to_utf8("%s %s\n" % (total_vec, vector_size)))
# store in sorted order: most frequent words at the top
for word, row in vocab.items():
if binary:
row = row.astype(REAL)
fout.write(utils.to_utf8(word) + b" " + row.tostring())
else:
fout.write(utils.to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))
And then use
m2 = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format('train.bin', binary=True)
To load the model as word2vec
If you've calculated the word-vectors with your own code, you may want to write them to a file in a format compatible with Google's original word2vec.c or gensim. You can review the gensim code in KeyedVectors.save_word2vec_format() to see exactly how its vectors are written – it's less than 20 lines of code – and do something similar to your vectors. See:
https://github.com/RaRe-Technologies/gensim/blob/3d2227d58b10d0493006a3d7e63b98d64e991e60/gensim/models/keyedvectors.py#L130
Then you could re-load vectors that originated with your code and use them almost directly with examples like the one from Jeff Delaney you mention.

How is the feature score(/importance) in the XGBoost package calculated?

The command xgb.importance returns a graph of feature importance measured by an f score.
What does this f score represent and how is it calculated?
Output:
Graph of feature importance
This is a metric that simply sums up how many times each feature is split on. It is analogous to the Frequency metric in the R version.https://cran.r-project.org/web/packages/xgboost/xgboost.pdf
It is about as basic a feature importance metric as you can get.
i.e. How many times was this variable split on?
The code for this method shows it is simply adding of the presence of a given feature in all the trees.
[here..https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L953][1]
def get_fscore(self, fmap=''):
"""Get feature importance of each feature.
Parameters
----------
fmap: str (optional)
The name of feature map file
"""
trees = self.get_dump(fmap) ## dump all the trees to text
fmap = {}
for tree in trees: ## loop through the trees
for line in tree.split('\n'): # text processing
arr = line.split('[')
if len(arr) == 1: # text processing
continue
fid = arr[1].split(']')[0] # text processing
fid = fid.split('<')[0] # split on the greater/less(find variable name)
if fid not in fmap: # if the feature id hasn't been seen yet
fmap[fid] = 1 # add it
else:
fmap[fid] += 1 # else increment it
return fmap # return the fmap, which has the counts of each time a variable was split on
I found this answer correct and thorough. It shows the implementation of the feature_importances.
https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting

Categories

Resources