I am trying to create nested dictionaries as I loop through tokens output by my NER model. This is the code that I have so far:
token_classifier = pipeline('ner', model='./fine_tune_nerbert_output/', tokenizer = './fine_tune_nerbert_output/', aggregation_strategy="average")
sentence = "alisa brown i live in san diego, california and sometimes in kansas city, missouri"
tokens = token_classifier(sentence)
which outputs:
[{'entity_group': 'LABEL_1',
'score': 0.99938214,
'word': 'alisa',
'start': 0,
'end': 5},
{'entity_group': 'LABEL_2',
'score': 0.9972813,
'word': 'brown',
'start': 6,
'end': 11},
{'entity_group': 'LABEL_0',
'score': 0.99798816,
'word': 'i live in',
'start': 12,
'end': 21},
{'entity_group': 'LABEL_3',
'score': 0.9993938,
'word': 'san',
'start': 22,
'end': 25},
{'entity_group': 'LABEL_4',
'score': 0.9988097,
'word': 'diego',
'start': 26,
'end': 31},
{'entity_group': 'LABEL_0',
'score': 0.9996742,
'word': ',',
'start': 31,
'end': 32},
{'entity_group': 'LABEL_3',
'score': 0.9985813,
'word': 'california',
'start': 33,
'end': 43},
{'entity_group': 'LABEL_0',
'score': 0.9997311,
'word': 'and sometimes in',
'start': 44,
'end': 60},
{'entity_group': 'LABEL_3',
'score': 0.9995384,
'word': 'kansas',
'start': 61,
'end': 67},
{'entity_group': 'LABEL_4',
'score': 0.9988242,
'word': 'city',
'start': 68,
'end': 72},
{'entity_group': 'LABEL_0',
'score': 0.99949193,
'word': ',',
'start': 72,
'end': 73},
{'entity_group': 'LABEL_3',
'score': 0.99960154,
'word': 'missouri',
'start': 74,
'end': 82}]
I then run a for loop:
ner_dict = dict()
nested_dict = dict()
for token in tokens:
if token['entity_group'] != 'LABEL_0':
if token['entity_group'] in ner_dict:
nested_dict[token['entity_group']] = {}
nested_dict[token['entity_group']][token['word']] = token['score']
ner_dict.update({token['entity_group']: (ner_dict[token['entity_group']], nested_dict[token['entity_group']])})
else:
ner_dict[token['entity_group']] = {}
ner_dict[token['entity_group']][token['word']] = token['score']
this outputs:
{'LABEL_1': {'devyn': 0.9995816},
'LABEL_2': {'donahue': 0.9996502},
'LABEL_3': ((({'san': 0.9994766}, {'california': 0.998961}),
{'san': 0.99925905}),
{'california': 0.9987863}),
'LABEL_4': ({'francisco': 0.99923646}, {'diego': 0.9992399})}
which is close to what I want but this is my ideal output:
{'LABEL_1': {'devyn': 0.9995816},
'LABEL_2': {'donahue': 0.9996502},
'LABEL_3': ({'san': 0.9994766}, {'california': 0.998961}, {'san': 0.99925905},
{'california': 0.9987863}),
'LABEL_4': ({'francisco': 0.99923646}, {'diego': 0.9992399})}
how would I do this without getting each entry in a different tuple? Thanks in advance.
Your output for LABEL_4 should be diego and city based on the input provided. Something like below :
{
'LABEL_1': {'alisa': 0.99938214},
'LABEL_2': {'brown': 0.9972813},
'LABEL_3': {'san': 0.9993938, 'california': 0.9985813, 'kansas': 0.9995384},
'LABEL_4': {'diego': 0.9988097, 'city': 0.9988242}
}
If the above output is what you desire, change the code to
ner_dict = dict()
for token in tokens:
if token['entity_group'] != 'LABEL_0':
nested_dict = ner_dict.setdefault(token['entity_group'], {})
nested_dict[token['word']] = token['score']
Here example that you can use with your code
ner_dict = {}
for token in tokens:
if token['entity_group'] != 'LABEL_0':
ner_dict.setdefault(token['entity_group'], {})[token['word']] = token['score']
I have a trained BERT model that I am willing to use in order to annotate some text.
I am using the transformers pipeline for NER task in the following way:
mode = AutoModelForTokenClassification.from_pretrained(<my_model_path>)
tokenize = BertTokenizer.from_pretrained(<my_model_path>)
nlp_ner = pipeline(
"ner",
model=mode,
tokenizer=tokenize
)
Then, I am obtaining the prediction results by calling:
text = "3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing."
result = nlp_ner(text)
Where the returned result is:
[{'entity': 'LABEL_1', 'score': 0.99999774, 'index': 1, 'word': '3', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999979, 'index': 2, 'word': ')', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9999897, 'index': 3, 'word': 'rewrite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 4, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999962, 'index': 5, 'word': 'last', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999975, 'index': 6, 'word': 'sentence', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998623, 'index': 7, 'word': '“', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99997735, 'index': 8, 'word': 'scanning', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9941041, 'index': 9, 'word': 'probe', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.999994, 'index': 10, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999696, 'index': 11, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 12, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998647, 'index': 13, 'word': 'per', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999939, 'index': 14, 'word': '##ovsk', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.99999154, 'index': 15, 'word': '##ite', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999942, 'index': 16, 'word': 'family', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.9997022, 'index': 17, 'word': '”', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999929, 'index': 18, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999977, 'index': 19, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999076, 'index': 20, 'word': 'current', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99996257, 'index': 21, 'word': 'one', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9169066, 'index': 22, 'word': 'is', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.6795164, 'index': 23, 'word': 'quite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.7315716, 'index': 24, 'word': 'conf', 'start': None, 'end': None}, {'entity': 'LABEL_9', 'score': 0.9067044, 'index': 25, 'word': '##using', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999925, 'index': 26, 'word': '.', 'start': None, 'end': None}]
The problem I am facing now is that I would like to annotate the predicted classes back to the text itself, but this looks complicated as the prediction results are not indexing words by expolding them with a space, but for example composed words are seen as multiple words etc.
Is there a way to annotate back the text (for example in Doccanno json format) that is not too complex?
My goal is to be able to say: For all the "LABEL_9", highlight the initial text with a specific html class. Or even easier, find the start and the end index for all words predicted as being of class "LABEL_9".
The Tokenizer you are using inside the pipeline does not give out the offset information and so you get None in the model response for start. Change it to Fast Tokenizer and you should get the start location.
Solution
from transformers import pipeline, AutoModelForTokenClassification, PreTrainedTokenizerFast
model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased")
tokenizer = PreTrainedTokenizerFast.from_pretrained("bert-base-uncased")
nlp_ner = pipeline(
"ner",
model=model,
tokenizer=tokenizer
)
text = "3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing."
results = nlp_ner(text)
for result in results:
start = result["start"]
end = result["start"]+len(result["word"].replace("##", ""))
tag = result["entity"]
print (f"Entity: {tag}, Start:{start}, End:{end}, Token:{text[start:end]}")
Output:
Entity: LABEL_1, Start:0, End:1, Token:3
Entity: LABEL_0, Start:1, End:2, Token:)
Entity: LABEL_1, Start:3, End:5, Token:Re
Entity: LABEL_1, Start:5, End:10, Token:write
Entity: LABEL_1, Start:11, End:14, Token:the
Entity: LABEL_0, Start:15, End:19, Token:last
Entity: LABEL_1, Start:20, End:28, Token:sentence
Entity: LABEL_1, Start:29, End:30, Token:“
Entity: LABEL_1, Start:30, End:38, Token:scanning
Entity: LABEL_1, Start:39, End:44, Token:probe
Entity: LABEL_1, Start:45, End:46, Token:.
Entity: LABEL_1, Start:46, End:47, Token:.
Entity: LABEL_1, Start:47, End:48, Token:.
Entity: LABEL_1, Start:49, End:52, Token:per
Entity: LABEL_1, Start:52, End:54, Token:ov
Entity: LABEL_1, Start:54, End:57, Token:ski
Entity: LABEL_0, Start:57, End:59, Token:te
Entity: LABEL_0, Start:60, End:66, Token:family
Entity: LABEL_0, Start:66, End:67, Token:”
Entity: LABEL_0, Start:67, End:68, Token:.
Entity: LABEL_1, Start:69, End:72, Token:The
Entity: LABEL_0, Start:73, End:80, Token:current
Entity: LABEL_0, Start:81, End:84, Token:one
Entity: LABEL_0, Start:85, End:87, Token:is
Entity: LABEL_1, Start:88, End:93, Token:quite
Entity: LABEL_1, Start:94, End:103, Token:confusing
Entity: LABEL_0, Start:103, End:104, Token:.
wordpiece tokenizer
BERT tokenizer is a wordpiece tokenizer, i.e, a single word might get tokenzied into multiple tokens (for example perovskite in the above). It is a standard practice to use the entity of the first token in case if a word gets tokenized into multiple. You can use other strategies like max voting etc. Below is the logic to use to get the entities per words.
formatted_results = []
for result in results:
end = result["start"]+len(result["word"].replace("##", ""))
if result["word"].startswith("##"):
formatted_results[-1]["end"] = end
formatted_results[-1]["word"]+= result["word"].replace("##", "")
else:
formatted_results.append({
'start': result["start"],
'end': end,
'entity': result["entity"],
'index': result["index"],
'score': result["score"],
'word': result["word"]})
for result in formatted_results:
print (f"""Entity: {result["entity"]}, Start:{result["start"]}, End:{result["end"]}, word:{text[result["start"]:result["end"]]}""")
Output:
Entity: LABEL_1, Start:0, End:1, word:3
Entity: LABEL_1, Start:1, End:2, word:)
Entity: LABEL_1, Start:3, End:10, word:Rewrite
Entity: LABEL_1, Start:11, End:14, word:the
Entity: LABEL_1, Start:15, End:19, word:last
Entity: LABEL_0, Start:20, End:28, word:sentence
Entity: LABEL_1, Start:29, End:30, word:“
Entity: LABEL_1, Start:30, End:38, word:scanning
Entity: LABEL_1, Start:39, End:44, word:probe
Entity: LABEL_1, Start:45, End:46, word:.
Entity: LABEL_1, Start:46, End:47, word:.
Entity: LABEL_0, Start:47, End:48, word:.
Entity: LABEL_1, Start:49, End:59, word:perovskite
Entity: LABEL_0, Start:60, End:66, word:family
Entity: LABEL_1, Start:66, End:67, word:”
Entity: LABEL_0, Start:67, End:68, word:.
Entity: LABEL_0, Start:69, End:72, word:The
Entity: LABEL_1, Start:73, End:80, word:current
Entity: LABEL_1, Start:81, End:84, word:one
Entity: LABEL_1, Start:85, End:87, word:is
Entity: LABEL_1, Start:88, End:93, word:quite
Entity: LABEL_1, Start:94, End:103, word:confusing
Entity: LABEL_0, Start:103, End:104, word:.
The only subtle point for keeping in mind here is that the transformers tokenizer doesn't split sentences using white spaces, they break texts into subwords (in most cases!). Two consecutive #s always signify the start of a subword, this property can be exploited for reconstructing original words.
With this in mind an unprincipled implementation for converting the output of the pipeline into Doccano format would be like this:
UPDATE: adding a new argument to the function original_text would help with reconstructing text and getting rid of the extra spaces.
text = "3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing."
output = [{'entity': 'LABEL_1', 'score': 0.99999774, 'index': 1, 'word': '3', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999979, 'index': 2, 'word': ')', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9999897, 'index': 3, 'word': 'rewrite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 4, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999962, 'index': 5, 'word': 'last', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999975, 'index': 6, 'word': 'sentence', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998623, 'index': 7, 'word': '“', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99997735, 'index': 8, 'word': 'scanning', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9941041, 'index': 9, 'word': 'probe', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.999994, 'index': 10, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999696, 'index': 11, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 12, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998647, 'index': 13, 'word': 'per', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999939, 'index': 14, 'word': '##ovsk', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.99999154, 'index': 15, 'word': '##ite', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999942, 'index': 16, 'word': 'family', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.9997022, 'index': 17, 'word': '”', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999929, 'index': 18, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999977, 'index': 19, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999076, 'index': 20, 'word': 'current', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99996257, 'index': 21, 'word': 'one', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9169066, 'index': 22, 'word': 'is', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.6795164, 'index': 23, 'word': 'quite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.7315716, 'index': 24, 'word': 'conf', 'start': None, 'end': None}, {'entity': 'LABEL_9', 'score': 0.9067044, 'index': 25, 'word': '##using', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999925, 'index': 26, 'word': '.', 'start': None, 'end': None}]
def pipelineoutput_2_doccon(pipelineoutput, original_text, merge_labels=True):
words = [item['word'] for item in pipelineoutput]
labels = [item['entity'] for item in pipelineoutput]
tags = []
text = ''
for idx, (w, l) in enumerate(zip(words, labels)):
if w.startswith('##'):
length=len(text)
text+=w[2:]
tags.append([length, length+len(w[2:]), l, w])
else:
if text!='':
text+=' '
length=len(text)
text+=w
tags.append([length, length+len(w), l, w])
length+=len(text)
if not merge_labels:
final_tags = []
temp_text = original_text.lower()
accumulated = 0
for item in tags:
start = temp_text.find(item[-1])
end = start+len(item[-1])
final_tags.append([accumulated+start, accumulated+end, item[-2]])
temp_text = temp_text[end:]
accumulated+=end
return {"text": original_text, "label":final_tags}
else:
merged_tags = []
for idx in range(len(tags)):
if (tags[idx][1]==len(text) or text[tags[idx][1]]==' '):
merged_tags.append(tags[idx])
else:
tags[idx+1][0] = tags[idx][0]
tags[idx+1][-1] = tags[idx][-1]+tags[idx+1][-1].replace('##', '')
final_tags = []
temp_text = original_text.lower()
accumulated = 0
for item in merged_tags:
start = temp_text.find(item[-1])
end = start+len(item[-1])
final_tags.append([accumulated+start, accumulated+end, item[-2]])
temp_text = temp_text[end:]
accumulated+=end
return {"text": original_text, "label":final_tags}
pipelineoutput_2_doccon(output, text, merge_labels=False)
output:
{'label': [[0, 1, 'LABEL_1'],
[1, 2, 'LABEL_1'],
[3, 10, 'LABEL_8'],
[11, 14, 'LABEL_1'],
[15, 19, 'LABEL_1'],
[20, 28, 'LABEL_1'],
[29, 30, 'LABEL_2'],
[30, 38, 'LABEL_2'],
[39, 44, 'LABEL_3'],
[45, 46, 'LABEL_1'],
[46, 47, 'LABEL_1'],
[47, 48, 'LABEL_1'],
[49, 59, 'LABEL_3'],
[60, 66, 'LABEL_3'],
[66, 67, 'LABEL_2'],
[67, 68, 'LABEL_1'],
[69, 72, 'LABEL_1'],
[73, 80, 'LABEL_1'],
[81, 84, 'LABEL_1'],
[85, 87, 'LABEL_8'],
[88, 93, 'LABEL_1'],
[94, 103, 'LABEL_9'],
[103, 104, 'LABEL_1']],
'text': '3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing.'}
As you may be noticed label assignment is happening at the subword level. If the sequence labeler has gained good knowledge about the task it's expected to assign unique labels to subwords of a word otherwise it's your choice to aggregate labels in the way which meets your goals the best.
Execusme, i need your help!
Code Script
tracks_ = []
track = {}
if category == 'reference':
for i in range(len(tracks)):
if len(tracks) >= 1:
_tracks = tracks[i]
track['id'] = _track['id']
tracks_.append(track)
print (tracks_)
tracks File
[{'id': 345, 'mode': 'ghost', 'missed': 27, 'box': [0.493, 0.779, 0.595, 0.808], 'score': 89, 'class': 1, 'time': 3352}, {'id': 347, 'mode': 'ghost', 'missed': 9, 'box': [0.508, 0.957, 0.631, 0.996], 'score': 89, 'class': 1, 'time': 5463}, {'id': 914, 'mode': 'track', 'missed': 0, 'box': [0.699, 0.496, 0.991, 0.581], 'score': 87, 'class': 62, 'time': 6549}, {'id': 153, 'mode': 'track', 'missed': 0, 'box': [0.613, 0.599, 0.88, 0.689], 'score': 73, 'class': 62, 'time': 6549}, {'id': 588, 'mode': 'track', 'missed': 0, 'box': [0.651, 0.685, 0.958, 0.775], 'score': 79, 'class': 62, 'time': 6549}, {'id': 972, 'mode': 'track', 'missed': 0, 'box': [0.632, 0.04, 0.919, 0.126], 'score': 89, 'class': 62, 'time': 6549}, {'id': 300, 'mode': 'ghost', 'missed': 6, 'box': [0.591, 0.457, 0.74, 0.498], 'score': 71, 'class': 62, 'time': 5716}]
Based on the codescript and the input above, i want to print out the tracks_ and the result is
[{'id': 300}, {'id': 300}, {'id': 300}, {'id': 300}, {'id': 300}, {'id': 300}, {'id': 300}]
but, the result that print out should be like this :
[{'id': 345}, {'id': 347},{'id': 914}, {'id': 153}, {'id': 588}, {'id': 972}, {'id': 300}, ]
you are appending to your list track_ the same dict , which causes to have in your list only references of the same dict, practically you have only one dict in your list tracks_, and any modification to the dict track will be reflected in all the elements of your list, to fix you should create a new dict on each iteration:
if category == 'reference' and len(tracks) >= 1:
for d in tracks:
tracks_.append({'id' : d['id']})
you could use a list comprehension:
tracks_ = [{'id': t['id']} for t in tracks]
tracks_
output:
[{'id': 345},
{'id': 347},
{'id': 914},
{'id': 153},
{'id': 588},
{'id': 972},
{'id': 300}]