Altering string using a list of dictionaries - python

Background
I am using NeuroNER http://neuroner.com/ to label text data sample_string as seen below.
sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2000 and her number is 1111112222'
Output (using NeuroNER)
My output is a list of dictionary dic_list
dic_list = [
{'id': 'T1', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Jane'},
{'id': 'T2', 'type': 'PATIENT', 'start': 13, 'end': 17, 'text': 'Candy'},
{'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
{'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2000'},
{'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '1111112222'}]
Legend
id = text ID
type = type of text being identified
start = starting position of identified text
end = ending position of identified text
text = text that is identified
Goal
Since the location of the text(e.g. Jane) is given by start and end, I would like to change each text from dic_list to **BLOCK** in my list sample_string
Desired Output
sample_string = 'Patient **BLOCK** **BLOCK** was seen by Dr. **BLOCK** on **BLOCK** and her number is **BLOCK**
Question
I have tried Replacing a character from a certain index and Edit the values in a list of dictionaries? but they are not quite what I am looking for
How do I achieve my desired output?

If you want a solution based on the start and end indexes,
you can use the intervals between those is dic_list, to know which parts you need. then join them with **BLOCK**.
try this:
sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'
dic_list = [
{'id': 'T1', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Jane'},
{'id': 'T2', 'type': 'PATIENT', 'start': 13, 'end': 17, 'text': 'Candy'},
{'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
{'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},
{'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]
parts_to_take = [(0, dic_list[0]['start'])] + [(first["end"]+1, second["start"]) for first, second in zip(dic_list, dic_list[1:])] + [(dic_list[-1]['end'], len(sample_string)-1)]
parts = [sample_string[start:end] for start, end in parts_to_take]
sample_string = '**BLOCK**'.join(parts)
print(sample_string)

I may be missing something but you can just use .replace():
sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'
dic_list = [
{'id': 'T1', 'type': 'PATIENT', 'start': 0, 'end': 6, 'text': 'Jane'},
{'id': 'T2', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Candy'},
{'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
{'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},
{'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]
for dic in dic_list:
sample_string = sample_string.replace(dic['text'], '**BLOCK**')
print(sample_string)
Though regex will probably be faster:
import re
sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'
dic_list = [
{'id': 'T1', 'type': 'PATIENT', 'start': 0, 'end': 6, 'text': 'Jane'},
{'id': 'T2', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Candy'},
{'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
{'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},
{'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]
pattern = re.compile('|'.join(dic['text'] for dic in dic_list))
result = pattern.sub('**BLOCK**', sample_string)
print(result)
Both output:
Patient **BLOCK** **BLOCK** was seen by Dr. **BLOCK** on **BLOCK** and her number is **BLOCK**

per the suggestion of # Error - Syntactical Remorse
sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'
dic_list = [
{'id': 'T1', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Jane'},
{'id': 'T2', 'type': 'PATIENT', 'start': 13, 'end': 17, 'text': 'Candy'},
{'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
{'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},
{'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]
offset = 0
filler = '**BLOCK**'
for dic in dic_list:
sample_string = sample_string[:dic['start'] + offset ] + filler + sample_string[dic['end'] + offset + 1:]
offset += dic['start'] - dic['end'] + len(filler) - 1
print(sample_string)

Related

Append a list of dictionaries to the value in another dictionary

I am trying to create nested dictionaries as I loop through tokens output by my NER model. This is the code that I have so far:
token_classifier = pipeline('ner', model='./fine_tune_nerbert_output/', tokenizer = './fine_tune_nerbert_output/', aggregation_strategy="average")
sentence = "alisa brown i live in san diego, california and sometimes in kansas city, missouri"
tokens = token_classifier(sentence)
which outputs:
[{'entity_group': 'LABEL_1',
'score': 0.99938214,
'word': 'alisa',
'start': 0,
'end': 5},
{'entity_group': 'LABEL_2',
'score': 0.9972813,
'word': 'brown',
'start': 6,
'end': 11},
{'entity_group': 'LABEL_0',
'score': 0.99798816,
'word': 'i live in',
'start': 12,
'end': 21},
{'entity_group': 'LABEL_3',
'score': 0.9993938,
'word': 'san',
'start': 22,
'end': 25},
{'entity_group': 'LABEL_4',
'score': 0.9988097,
'word': 'diego',
'start': 26,
'end': 31},
{'entity_group': 'LABEL_0',
'score': 0.9996742,
'word': ',',
'start': 31,
'end': 32},
{'entity_group': 'LABEL_3',
'score': 0.9985813,
'word': 'california',
'start': 33,
'end': 43},
{'entity_group': 'LABEL_0',
'score': 0.9997311,
'word': 'and sometimes in',
'start': 44,
'end': 60},
{'entity_group': 'LABEL_3',
'score': 0.9995384,
'word': 'kansas',
'start': 61,
'end': 67},
{'entity_group': 'LABEL_4',
'score': 0.9988242,
'word': 'city',
'start': 68,
'end': 72},
{'entity_group': 'LABEL_0',
'score': 0.99949193,
'word': ',',
'start': 72,
'end': 73},
{'entity_group': 'LABEL_3',
'score': 0.99960154,
'word': 'missouri',
'start': 74,
'end': 82}]
I then run a for loop:
ner_dict = dict()
nested_dict = dict()
for token in tokens:
if token['entity_group'] != 'LABEL_0':
if token['entity_group'] in ner_dict:
nested_dict[token['entity_group']] = {}
nested_dict[token['entity_group']][token['word']] = token['score']
ner_dict.update({token['entity_group']: (ner_dict[token['entity_group']], nested_dict[token['entity_group']])})
else:
ner_dict[token['entity_group']] = {}
ner_dict[token['entity_group']][token['word']] = token['score']
this outputs:
{'LABEL_1': {'devyn': 0.9995816},
'LABEL_2': {'donahue': 0.9996502},
'LABEL_3': ((({'san': 0.9994766}, {'california': 0.998961}),
{'san': 0.99925905}),
{'california': 0.9987863}),
'LABEL_4': ({'francisco': 0.99923646}, {'diego': 0.9992399})}
which is close to what I want but this is my ideal output:
{'LABEL_1': {'devyn': 0.9995816},
'LABEL_2': {'donahue': 0.9996502},
'LABEL_3': ({'san': 0.9994766}, {'california': 0.998961}, {'san': 0.99925905},
{'california': 0.9987863}),
'LABEL_4': ({'francisco': 0.99923646}, {'diego': 0.9992399})}
how would I do this without getting each entry in a different tuple? Thanks in advance.
Your output for LABEL_4 should be diego and city based on the input provided. Something like below :
{
'LABEL_1': {'alisa': 0.99938214},
'LABEL_2': {'brown': 0.9972813},
'LABEL_3': {'san': 0.9993938, 'california': 0.9985813, 'kansas': 0.9995384},
'LABEL_4': {'diego': 0.9988097, 'city': 0.9988242}
}
If the above output is what you desire, change the code to
ner_dict = dict()
for token in tokens:
if token['entity_group'] != 'LABEL_0':
nested_dict = ner_dict.setdefault(token['entity_group'], {})
nested_dict[token['word']] = token['score']
Here example that you can use with your code
ner_dict = {}
for token in tokens:
if token['entity_group'] != 'LABEL_0':
ner_dict.setdefault(token['entity_group'], {})[token['word']] = token['score']

Python: How to annotate back to text the output of a transformers pipeline

I have a trained BERT model that I am willing to use in order to annotate some text.
I am using the transformers pipeline for NER task in the following way:
mode = AutoModelForTokenClassification.from_pretrained(<my_model_path>)
tokenize = BertTokenizer.from_pretrained(<my_model_path>)
nlp_ner = pipeline(
"ner",
model=mode,
tokenizer=tokenize
)
Then, I am obtaining the prediction results by calling:
text = "3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing."
result = nlp_ner(text)
Where the returned result is:
[{'entity': 'LABEL_1', 'score': 0.99999774, 'index': 1, 'word': '3', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999979, 'index': 2, 'word': ')', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9999897, 'index': 3, 'word': 'rewrite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 4, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999962, 'index': 5, 'word': 'last', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999975, 'index': 6, 'word': 'sentence', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998623, 'index': 7, 'word': '“', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99997735, 'index': 8, 'word': 'scanning', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9941041, 'index': 9, 'word': 'probe', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.999994, 'index': 10, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999696, 'index': 11, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 12, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998647, 'index': 13, 'word': 'per', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999939, 'index': 14, 'word': '##ovsk', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.99999154, 'index': 15, 'word': '##ite', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999942, 'index': 16, 'word': 'family', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.9997022, 'index': 17, 'word': '”', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999929, 'index': 18, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999977, 'index': 19, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999076, 'index': 20, 'word': 'current', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99996257, 'index': 21, 'word': 'one', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9169066, 'index': 22, 'word': 'is', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.6795164, 'index': 23, 'word': 'quite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.7315716, 'index': 24, 'word': 'conf', 'start': None, 'end': None}, {'entity': 'LABEL_9', 'score': 0.9067044, 'index': 25, 'word': '##using', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999925, 'index': 26, 'word': '.', 'start': None, 'end': None}]
The problem I am facing now is that I would like to annotate the predicted classes back to the text itself, but this looks complicated as the prediction results are not indexing words by expolding them with a space, but for example composed words are seen as multiple words etc.
Is there a way to annotate back the text (for example in Doccanno json format) that is not too complex?
My goal is to be able to say: For all the "LABEL_9", highlight the initial text with a specific html class. Or even easier, find the start and the end index for all words predicted as being of class "LABEL_9".
The Tokenizer you are using inside the pipeline does not give out the offset information and so you get None in the model response for start. Change it to Fast Tokenizer and you should get the start location.
Solution
from transformers import pipeline, AutoModelForTokenClassification, PreTrainedTokenizerFast
model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased")
tokenizer = PreTrainedTokenizerFast.from_pretrained("bert-base-uncased")
nlp_ner = pipeline(
"ner",
model=model,
tokenizer=tokenizer
)
text = "3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing."
results = nlp_ner(text)
for result in results:
start = result["start"]
end = result["start"]+len(result["word"].replace("##", ""))
tag = result["entity"]
print (f"Entity: {tag}, Start:{start}, End:{end}, Token:{text[start:end]}")
Output:
Entity: LABEL_1, Start:0, End:1, Token:3
Entity: LABEL_0, Start:1, End:2, Token:)
Entity: LABEL_1, Start:3, End:5, Token:Re
Entity: LABEL_1, Start:5, End:10, Token:write
Entity: LABEL_1, Start:11, End:14, Token:the
Entity: LABEL_0, Start:15, End:19, Token:last
Entity: LABEL_1, Start:20, End:28, Token:sentence
Entity: LABEL_1, Start:29, End:30, Token:“
Entity: LABEL_1, Start:30, End:38, Token:scanning
Entity: LABEL_1, Start:39, End:44, Token:probe
Entity: LABEL_1, Start:45, End:46, Token:.
Entity: LABEL_1, Start:46, End:47, Token:.
Entity: LABEL_1, Start:47, End:48, Token:.
Entity: LABEL_1, Start:49, End:52, Token:per
Entity: LABEL_1, Start:52, End:54, Token:ov
Entity: LABEL_1, Start:54, End:57, Token:ski
Entity: LABEL_0, Start:57, End:59, Token:te
Entity: LABEL_0, Start:60, End:66, Token:family
Entity: LABEL_0, Start:66, End:67, Token:”
Entity: LABEL_0, Start:67, End:68, Token:.
Entity: LABEL_1, Start:69, End:72, Token:The
Entity: LABEL_0, Start:73, End:80, Token:current
Entity: LABEL_0, Start:81, End:84, Token:one
Entity: LABEL_0, Start:85, End:87, Token:is
Entity: LABEL_1, Start:88, End:93, Token:quite
Entity: LABEL_1, Start:94, End:103, Token:confusing
Entity: LABEL_0, Start:103, End:104, Token:.
wordpiece tokenizer
BERT tokenizer is a wordpiece tokenizer, i.e, a single word might get tokenzied into multiple tokens (for example perovskite in the above). It is a standard practice to use the entity of the first token in case if a word gets tokenized into multiple. You can use other strategies like max voting etc. Below is the logic to use to get the entities per words.
formatted_results = []
for result in results:
end = result["start"]+len(result["word"].replace("##", ""))
if result["word"].startswith("##"):
formatted_results[-1]["end"] = end
formatted_results[-1]["word"]+= result["word"].replace("##", "")
else:
formatted_results.append({
'start': result["start"],
'end': end,
'entity': result["entity"],
'index': result["index"],
'score': result["score"],
'word': result["word"]})
for result in formatted_results:
print (f"""Entity: {result["entity"]}, Start:{result["start"]}, End:{result["end"]}, word:{text[result["start"]:result["end"]]}""")
Output:
Entity: LABEL_1, Start:0, End:1, word:3
Entity: LABEL_1, Start:1, End:2, word:)
Entity: LABEL_1, Start:3, End:10, word:Rewrite
Entity: LABEL_1, Start:11, End:14, word:the
Entity: LABEL_1, Start:15, End:19, word:last
Entity: LABEL_0, Start:20, End:28, word:sentence
Entity: LABEL_1, Start:29, End:30, word:“
Entity: LABEL_1, Start:30, End:38, word:scanning
Entity: LABEL_1, Start:39, End:44, word:probe
Entity: LABEL_1, Start:45, End:46, word:.
Entity: LABEL_1, Start:46, End:47, word:.
Entity: LABEL_0, Start:47, End:48, word:.
Entity: LABEL_1, Start:49, End:59, word:perovskite
Entity: LABEL_0, Start:60, End:66, word:family
Entity: LABEL_1, Start:66, End:67, word:”
Entity: LABEL_0, Start:67, End:68, word:.
Entity: LABEL_0, Start:69, End:72, word:The
Entity: LABEL_1, Start:73, End:80, word:current
Entity: LABEL_1, Start:81, End:84, word:one
Entity: LABEL_1, Start:85, End:87, word:is
Entity: LABEL_1, Start:88, End:93, word:quite
Entity: LABEL_1, Start:94, End:103, word:confusing
Entity: LABEL_0, Start:103, End:104, word:.
The only subtle point for keeping in mind here is that the transformers tokenizer doesn't split sentences using white spaces, they break texts into subwords (in most cases!). Two consecutive #s always signify the start of a subword, this property can be exploited for reconstructing original words.
With this in mind an unprincipled implementation for converting the output of the pipeline into Doccano format would be like this:
UPDATE: adding a new argument to the function original_text would help with reconstructing text and getting rid of the extra spaces.
text = "3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing."
output = [{'entity': 'LABEL_1', 'score': 0.99999774, 'index': 1, 'word': '3', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999979, 'index': 2, 'word': ')', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9999897, 'index': 3, 'word': 'rewrite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 4, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999962, 'index': 5, 'word': 'last', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999975, 'index': 6, 'word': 'sentence', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998623, 'index': 7, 'word': '“', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99997735, 'index': 8, 'word': 'scanning', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9941041, 'index': 9, 'word': 'probe', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.999994, 'index': 10, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999696, 'index': 11, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 12, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998647, 'index': 13, 'word': 'per', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999939, 'index': 14, 'word': '##ovsk', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.99999154, 'index': 15, 'word': '##ite', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999942, 'index': 16, 'word': 'family', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.9997022, 'index': 17, 'word': '”', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999929, 'index': 18, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999977, 'index': 19, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999076, 'index': 20, 'word': 'current', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99996257, 'index': 21, 'word': 'one', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9169066, 'index': 22, 'word': 'is', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.6795164, 'index': 23, 'word': 'quite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.7315716, 'index': 24, 'word': 'conf', 'start': None, 'end': None}, {'entity': 'LABEL_9', 'score': 0.9067044, 'index': 25, 'word': '##using', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999925, 'index': 26, 'word': '.', 'start': None, 'end': None}]
def pipelineoutput_2_doccon(pipelineoutput, original_text, merge_labels=True):
words = [item['word'] for item in pipelineoutput]
labels = [item['entity'] for item in pipelineoutput]
tags = []
text = ''
for idx, (w, l) in enumerate(zip(words, labels)):
if w.startswith('##'):
length=len(text)
text+=w[2:]
tags.append([length, length+len(w[2:]), l, w])
else:
if text!='':
text+=' '
length=len(text)
text+=w
tags.append([length, length+len(w), l, w])
length+=len(text)
if not merge_labels:
final_tags = []
temp_text = original_text.lower()
accumulated = 0
for item in tags:
start = temp_text.find(item[-1])
end = start+len(item[-1])
final_tags.append([accumulated+start, accumulated+end, item[-2]])
temp_text = temp_text[end:]
accumulated+=end
return {"text": original_text, "label":final_tags}
else:
merged_tags = []
for idx in range(len(tags)):
if (tags[idx][1]==len(text) or text[tags[idx][1]]==' '):
merged_tags.append(tags[idx])
else:
tags[idx+1][0] = tags[idx][0]
tags[idx+1][-1] = tags[idx][-1]+tags[idx+1][-1].replace('##', '')
final_tags = []
temp_text = original_text.lower()
accumulated = 0
for item in merged_tags:
start = temp_text.find(item[-1])
end = start+len(item[-1])
final_tags.append([accumulated+start, accumulated+end, item[-2]])
temp_text = temp_text[end:]
accumulated+=end
return {"text": original_text, "label":final_tags}
pipelineoutput_2_doccon(output, text, merge_labels=False)
output:
{'label': [[0, 1, 'LABEL_1'],
[1, 2, 'LABEL_1'],
[3, 10, 'LABEL_8'],
[11, 14, 'LABEL_1'],
[15, 19, 'LABEL_1'],
[20, 28, 'LABEL_1'],
[29, 30, 'LABEL_2'],
[30, 38, 'LABEL_2'],
[39, 44, 'LABEL_3'],
[45, 46, 'LABEL_1'],
[46, 47, 'LABEL_1'],
[47, 48, 'LABEL_1'],
[49, 59, 'LABEL_3'],
[60, 66, 'LABEL_3'],
[66, 67, 'LABEL_2'],
[67, 68, 'LABEL_1'],
[69, 72, 'LABEL_1'],
[73, 80, 'LABEL_1'],
[81, 84, 'LABEL_1'],
[85, 87, 'LABEL_8'],
[88, 93, 'LABEL_1'],
[94, 103, 'LABEL_9'],
[103, 104, 'LABEL_1']],
'text': '3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing.'}
As you may be noticed label assignment is happening at the subword level. If the sequence labeler has gained good knowledge about the task it's expected to assign unique labels to subwords of a word otherwise it's your choice to aggregate labels in the way which meets your goals the best.

remove duplicate dictionary python

I have a problem, I have a list like this:
[{'id': 34, 'questionid': 5, 'text': 'yes', 'score': 1}, {'id': 10, 'questionid': 5,
'text': 'test answer updated', 'score': 2}, {'id': 20, 'questionid': 5, 'text': 'no',
'score': 0}, {'id': 35, 'questionid': 5, 'text': 'yes', 'score': 1}]
and I want remove duplicate "questionid", "text" and "score", for example in this case I want output like this:
[{'id': 34, 'questionid': 5, 'text': 'yes', 'score': 1}, {'id': 10, 'questionid': 5,
'text': 'test answer updated', 'score': 2}, {'id': 20, 'questionid': 5, 'text': 'no',
'score': 0}]
How can I get this output in python?
We could create dictionary that has "questionid", "text" and "score" tuple as key and dicts as values and use this dictionary to check for duplicate values in data:
from operator import itemgetter
out = {}
for d in data:
key = itemgetter("questionid", "text", "score")(d)
if key not in out:
out[key] = d
out = list(out.values())
Output:
[{'id': 34, 'questionid': 5, 'text': 'yes', 'score': 1},
{'id': 10, 'questionid': 5, 'text': 'test answer updated', 'score': 2},
{'id': 20, 'questionid': 5, 'text': 'no', 'score': 0}]

How to extract particular values from a list of dictionaries?

I have a list of dictionaries which looks like this:
[{'Score': 0.9979117512702942, 'Type': 's_merchant', 'Text': 'merchants', 'BeginOffset': 7, 'EndOffset': 16}, {'Score': 0.9997400045394897, 'Type': 'metric', 'Text': 'number of errors', 'BeginOffset': 22, 'EndOffset': 38}, {'Score': 0.9984105825424194, 'Type': 'metric', 'Text': 'order rate', 'BeginOffset': 43, 'EndOffset': 53}, {'Score': 0.997801661491394, 'Type': 'user_service', 'Text': 'search requests', 'BeginOffset': 57, 'EndOffset': 72}, {'Score': 0.999964714050293, 'Type': 'PROPERTY', 'Text': 'revenue', 'BeginOffset': 20, 'EndOffset': 27}, {'Score': 0.999964714050293, 'Type': 'PROPERTY_VAL', 'Text': 'gold', 'BeginOffset': 28, 'EndOffset': 32}, {'Score': 0.9646918177604675, 'Type': 'ORGANIZATION', 'Text': 'Gymshark', 'BeginOffset': 22, 'EndOffset': 30}]
I need to extract all the values from keys 'Type' (which is basically 's_merchant' for the first dictionary) and 'Text'( which is 'merchants' for the first dictionary) from all the dictionaries in the list.
The output should be a list, something like this:
Type=['s_merchant','metric','user_service','PROPERTY','PROPERTY_VAL','ORGANIZATION']
Text=['merchants','number of errors','order rate','revenue','gold','Gymshark']
Is there a function/method to accomplish this?
Appreciate the help.
You can use python's list comprehension which allows more compact synthax than regular loop:
l = [{'Score': 0.9979117512702942, 'Type': 's_merchant', 'Text': 'merchants', 'BeginOffset': 7, 'EndOffset': 16}, {'Score': 0.9997400045394897, 'Type': 'metric', 'Text': 'number of errors', 'BeginOffset': 22, 'EndOffset': 38}, {'Score': 0.9984105825424194, 'Type': 'metric', 'Text': 'order rate', 'BeginOffset': 43, 'EndOffset': 53}, {'Score': 0.997801661491394, 'Type': 'user_service', 'Text': 'search requests', 'BeginOffset': 57, 'EndOffset': 72}, {'Score': 0.999964714050293, 'Type': 'PROPERTY', 'Text': 'revenue', 'BeginOffset': 20, 'EndOffset': 27}, {'Score': 0.999964714050293, 'Type': 'PROPERTY_VAL', 'Text': 'gold', 'BeginOffset': 28, 'EndOffset': 32}, {'Score': 0.9646918177604675, 'Type': 'ORGANIZATION', 'Text': 'Gymshark', 'BeginOffset': 22, 'EndOffset': 30}]
Type = [i['Type'] for i in l]
Text = [i['Text'] for i in l]
To remove duplicate values in list, a good option is to use a set object like:
list(set(Type))
With your example, just do:
Type = list(set([i['Type'] for i in l]))
Type = []
Text = []
for s in list_dicts :
Type.append(s['Type'])
Text.append(s['Text'])
Or with less code by using comprehension lists (but it's quite the same thing) :
Type = [s['Type'] for s in list_dicts]
Text = [s['Text'] for s in list_dicts]

How can i add the dictionary into list using append function or the other function?

Execusme, i need your help!
Code Script
tracks_ = []
track = {}
if category == 'reference':
for i in range(len(tracks)):
if len(tracks) >= 1:
_tracks = tracks[i]
track['id'] = _track['id']
tracks_.append(track)
print (tracks_)
tracks File
[{'id': 345, 'mode': 'ghost', 'missed': 27, 'box': [0.493, 0.779, 0.595, 0.808], 'score': 89, 'class': 1, 'time': 3352}, {'id': 347, 'mode': 'ghost', 'missed': 9, 'box': [0.508, 0.957, 0.631, 0.996], 'score': 89, 'class': 1, 'time': 5463}, {'id': 914, 'mode': 'track', 'missed': 0, 'box': [0.699, 0.496, 0.991, 0.581], 'score': 87, 'class': 62, 'time': 6549}, {'id': 153, 'mode': 'track', 'missed': 0, 'box': [0.613, 0.599, 0.88, 0.689], 'score': 73, 'class': 62, 'time': 6549}, {'id': 588, 'mode': 'track', 'missed': 0, 'box': [0.651, 0.685, 0.958, 0.775], 'score': 79, 'class': 62, 'time': 6549}, {'id': 972, 'mode': 'track', 'missed': 0, 'box': [0.632, 0.04, 0.919, 0.126], 'score': 89, 'class': 62, 'time': 6549}, {'id': 300, 'mode': 'ghost', 'missed': 6, 'box': [0.591, 0.457, 0.74, 0.498], 'score': 71, 'class': 62, 'time': 5716}]
Based on the codescript and the input above, i want to print out the tracks_ and the result is
[{'id': 300}, {'id': 300}, {'id': 300}, {'id': 300}, {'id': 300}, {'id': 300}, {'id': 300}]
but, the result that print out should be like this :
[{'id': 345}, {'id': 347},{'id': 914}, {'id': 153}, {'id': 588}, {'id': 972}, {'id': 300}, ]
you are appending to your list track_ the same dict , which causes to have in your list only references of the same dict, practically you have only one dict in your list tracks_, and any modification to the dict track will be reflected in all the elements of your list, to fix you should create a new dict on each iteration:
if category == 'reference' and len(tracks) >= 1:
for d in tracks:
tracks_.append({'id' : d['id']})
you could use a list comprehension:
tracks_ = [{'id': t['id']} for t in tracks]
tracks_
output:
[{'id': 345},
{'id': 347},
{'id': 914},
{'id': 153},
{'id': 588},
{'id': 972},
{'id': 300}]

Categories

Resources