I was planning to train a Spark NLP custom NER model, which uses the CoNLL 2003 format to do so (this blog even leaves some traning sample data to speed-up the follow-up). This "sample data" is NOT useful for me, as I have my own training data to train a model with; this data however, consists of a list of spaCy Doc objects and quite honestly, I don't know how to carry on with this conversion. I have found three approaches so far, each with some considerable weakness:
In spaCy's documentation, I have found an example code about how to build a SINGLE Doc to CoNLL using spacy_conll project, but notice it uses a blank spacy model, so it is not clear where "my own labeled data" comes to play; furthermore, it seems conll_formatter component is "added at the end of the pipeline", so it seems "no direct conversion from Doc to CoNLL is actually done"... Is my grasping correct?
In Prodigy forum (another product of the same designers of spaCy), I found this purposal, however that "CoNLL" (2003 I suppose?) format seems to be incomplete: the POS tag seems to be missing (which can be easily obtained via Token.pos_, as well as the "Syntactic chunk" (whose spaCy equivalent, does not seem to exist). These four fields are mentioned in CoNLL 2003 official documentation.
Speaking of a "direct conversion from Doc to CoNLL", I have also found this implementation based on textacy library, but it seems this implementation got deprecated by version 0.11.0, because "CONLL-U [...] wasn't enforced or guaranteed" , so I am not sure whether to use it or not (BTW, the most up-to-date textacy implementation when writing these lines, is 0.12.0)
My current code looks like:
import spacy
from spacy.training import offsets_to_biluo_tags
from spacy.tokens import Span
print("SPACY HELPER MODEL")
base_model = "en_core_web_sm"
nlp = spacy.load(base_model)
to_disable= ['parser', 'lemmatizer', 'ner']
_ = [nlp.remove_pipe(item) for item in to_disable]
print("Base model used: ", base_model)
print("Removed components: ", to_disable)
print("Enabled components: ", nlp.pipe_names)
# Assume text is already available as sentences...
# so no need for spaCy `sentencizer` or similar
print("\nDEMO SPACY DOC LIST BUILDING...", end="")
doc1 = nlp("iPhone X is coming.")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
doc2 = nlp("Space X is nice.")
doc2.ents = [Span(doc1, 0, 2, label="BRAND")]
docs = [doc1, doc2]
print("DONE!")
print("\nCoNLL 2003 CONVERSION:\n")
results = []
for doc in docs:
# Preliminary: whole sentence
whole_sentence = doc.text
# 1st item (CoNLL 2003): word
words = [token.text for token in doc]
# 2nd item (CoNLL 2003): POS
pos = [token.tag_ for token in doc]
# 3rd item (CoNLL 2003): syntactic chunk tag
sct = ["[UNKNOWN]" for token in doc]
# 4th item (CoNLL 2003): named entities
spacy_entities = [
(ent.start_char, ent.end_char, ent.label_)
for ent in doc.ents
]
biluo_entities = offsets_to_biluo_tags(doc, spacy_entities)
results.append((whole_sentence, words, pos, sct, biluo_entities))
for result in results:
print(
"\nDOC TEXT (NOT included in CoNLL 2003, just for demo): ",
result[0], "\n"
)
print("-DOCSTART- -X- -X- O")
for w,x,y,z in zip(result[1], result[2], result[3], result[4]):
print(w,x,y,z)
# Pending: write to a file, but that's easy, and out of topic.
Which gives as output:
DOC TEXT (NOT included in CoNLL 2003, just for demo): iPhone X is coming.
-DOCSTART- -X- -X- O
iPhone NNP [UNKNOWN] B-GADGET
X NNP [UNKNOWN] L-GADGET
is VBZ [UNKNOWN] O
coming VBG [UNKNOWN] O
. . [UNKNOWN] O
DOC TEXT (NOT included in CoNLL 2003, just for demo): Space X is nice.
-DOCSTART- -X- -X- O
Space NNP [UNKNOWN] B-BRAND
X NNP [UNKNOWN] L-BRAND
is VBZ [UNKNOWN] O
nice JJ [UNKNOWN] O
. . [UNKNOWN] O
Have you done something like this before?
Thanks!
If you look at a sample CoNLL file, you'll see they just separate entries with one blank line between them. So you just use a for loop.
for doc in docs:
for sent in doc.sents:
print("#", doc) # optional but makes it easier to read
print(sent._.conll_str)
print()
CoNLL files are split by sentence, not spaCy Doc, but if you don't have sentence boundaries you can just loop over docs. There also seems to be an option to turn on headers directly in the component, see their README.
not sure if this could help or not, but here's what I can add,
Spark-NLP NER won't use your POS tags, so if you could just fill them with foo-bar values, that could simplify your work.
Check JSL Annotation Lab product. It allows you to label data, it smoothly integrates with Spark-NLP NER. It's free.
With #AlbertoAndreotti's help, I managed to get to a functional workaround:
import spacy
from spacy.training import offsets_to_biluo_tags
from spacy.tokens import Span
print("SPACY HELPER MODEL")
base_model = "en_core_web_sm"
nlp = spacy.load(base_model)
to_disable= ['parser', 'lemmatizer', 'ner']
_ = [nlp.remove_pipe(item) for item in to_disable]
print("Base model used: ", base_model)
print("Removed components: ", to_disable)
print("Enabled components: ", nlp.pipe_names)
# Assume text is already available as sentences...
# so no need for spaCy `sentencizer` or similar
print("\nDEMO SPACY DOC LIST BUILDING...", end="")
doc1 = nlp("iPhone X is coming.")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
doc2 = nlp("Space X is nice.")
doc2.ents = [Span(doc1, 0, 2, label="BRAND")]
docs = [doc1, doc2]
print("DONE!")
print("\nCoNLL 2003 CONVERSION:\n")
results = []
for doc in docs:
# Preliminary: whole sentence
whole_sentence = doc.text
# 1st item (CoNLL 2003): word
words = [token.text for token in doc]
# 2nd item (CoNLL 2003): POS
pos = [token.tag_ for token in doc]
# 3rd item (CoNLL 2003): syntactic chunk tag
# sct = pos # Redundant, so will be left out
# 4th item (CoNLL 2003): named entities
spacy_entities = [
(ent.start_char, ent.end_char, ent.label_)
for ent in doc.ents
]
biluo_entities = offsets_to_biluo_tags(doc, spacy_entities)
results.append((whole_sentence, words, pos, biluo_entities))
for result in results:
print(
"\nDOC TEXT (NOT included in CoNLL 2003, just for demo): ",
result[0], "\n"
)
print("-DOCSTART- -X- -X- O")
for w,x,y,z in zip(result[1], result[2], result[2], result[3]):
print(w,x,y,z)
As complementary information, I found out that the 3rd missing item, "syntactic chunking tag", is related to a broader problem called "phrase chunking", that happens to be an unsolved problem in Computer Science, for which only aproximations have been got, so regardless of the library used, the conversion of that 3rd item specifically, into CoNLL 2033, might have errors. However, it seems Spark NLP does not care at all about 2nd & 3rd items, so the workaround suggested here, is acceptable.
For more details, you might want to put an eye on this thread.
I've got a list of company names that I am trying to parse from a large number of PDF documents.
I've forced the PDFs through Apache Tika to extract the raw text, and I've got the list of 200 companies read in.
I'm stuck trying to use some combination of FuzzyWuzzy and Spacy to extract the required matches.
This is as far as I've gotten:
import spacy
from fuzzywuzzy import fuzz, process
nlp = spacy.load("en_core_web_sm")
doc = nlp(strings[1])
companies = []
candidates = []
for ent in doc.ents:
if ent.label_ == "ORG":
candidates.append(ent.text)
process.extractBests(company_name, candidates, score_cutoff=80)
What I'm trying to do is:
Read through the document string
Parse for any fuzzy company name
matches scoring say 80+
Return company names that are contained in
the document and their scores.
Help!
This is the way I populated candidates -- mpg is a Pandas DataFrame:
for s in mpg['name'].values:
doc = nlp(s)
for ent in doc.ents:
if ent.label_ == 'ORG':
candidates.append(ent.text)
Then let's say we have a short list of car data just to test with:
candidates = ['buick'
,'buick skylark'
,'buick estate wagon'
,'buick century']
The below method uses fuzz.token_sort_ratio which is described as "returning a measure of the sequences' similarity between 0 and 100 but sorting the token before comparing." Try out some of the ones partially documented here: https://github.com/seatgeek/fuzzywuzzy/issues/137
results = {} # dictionary to store results
companies = ['buick'] # you'll have more companies
for company in companies:
results[company] = process.extractBests(company,candidates,
scorer=fuzz.token_sort_ratio,
score_cutoff=50)
And the results are:
In [53]: results
Out[53]: {'buick': [('buick', 100),
('buick skylark', 56),
('buick century', 56)]}
In this case using 80 as a cutoff score would work better than 50.
I want to know if there is an elegant way to get the index of an Entity with respect to a Sentence. I know I can get the index of an Entity in a string using ent.start_char and ent.end_char, but that value is with respect to the entire string.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion. Apple just launched a new Credit Card.")
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
I want the Entity Apple in both the sentences to point to start and end indexes 0 and 5 respectively. How can I do that?
You need to subtract the sentence start position from the entity start positions:
for ent in doc.ents:
print(ent.text, ent.start_char-ent.sent.start_char, ent.end_char-ent.sent.start_char, ent.label_)
# ^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^
Output:
Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY
Apple 0 5 ORG
Credit Card 26 37 ORG
I am trying to do:
Tokenize sentences from text
Compute Named Entities for each word present in sentence
This is what I have done so far:
nlp = spacy.load('en')
sentence = "Germany and U.S.A are popular countries. I am going to gym tonight"
sentence = nlp(sentence)
tokenized_sentences = []
for sent in sentence.sents:
tokenized_sentences.append(sent)
for s in tokenized_sentences:
labels = [ent.label_ for ent in s.ents]
entities = [ent.text for ent in s.ents]
Error:
labels = [ent.label_ for ent in s.ents]
AttributeError: 'spacy.tokens.span.Span' object has no attribute 'ents'
Is there any alternative way to find named entities of tokenized sentence?
Thanks in advance
Note that you only have two entities - USA and Germany.
The simple version:
sentence = nlp("Germany and U.S.A are popular countries. I am going to gym tonight")
for ent in sentence.ents:
print(ent.text, ent.label_)
What i think you are tying to do:
sentence = nlp("Germany and U.S.A are popular countries. I am going to gym tonight")
for sent in sentence.sents:
tmp = nlp(str(sent))
for ent in tmp.ents:
print(ent.text, ent.label_)
ents works only with doc (spacy.tokens.doc.Doc), if you use doc=nlp(text)
sent is of type spacy.tokens.span.Span which has no ents method.
Convert it to text and use nlp() again.
print([(ent.text, ent.label_) for ent in nlp(sent.text).ents])
I want to extract all bigrams and trigrams of the given sentences.
from gensim.models import Phrases
documents = ["the mayor of new york was there", "Human Computer Interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
trigram = Phrases(bigram(sentence_stream, min_count=1, threshold=2, delimiter=b' '))
for sent in sentence_stream:
#print(sent)
bigrams_ = bigram[sent]
trigrams_ = trigram[bigrams_]
print(bigrams_)
print(trigrams_)
The code works fine for bigrams and capture 'new york' and 'machine learning' ad bigrams.
However, I get the following error when I try to insert trigrams.
TypeError: 'Phrases' object is not callable
Please let me know, how to correct my code.
I am following the example documentation of gensim.
According to the docs, you can do:
from gensim.models import Phrases
from gensim.models.phrases import Phraser
phrases = Phrases(sentence_stream)
bigram = Phraser(phrases)
trigram = Phrases(bigram[sentence_stream])
bigram, being a Phrases object, cannot be called again, as you are doing so.