Convert spaCy `Doc` into CoNLL 2003 sample - python

I was planning to train a Spark NLP custom NER model, which uses the CoNLL 2003 format to do so (this blog even leaves some traning sample data to speed-up the follow-up). This "sample data" is NOT useful for me, as I have my own training data to train a model with; this data however, consists of a list of spaCy Doc objects and quite honestly, I don't know how to carry on with this conversion. I have found three approaches so far, each with some considerable weakness:
In spaCy's documentation, I have found an example code about how to build a SINGLE Doc to CoNLL using spacy_conll project, but notice it uses a blank spacy model, so it is not clear where "my own labeled data" comes to play; furthermore, it seems conll_formatter component is "added at the end of the pipeline", so it seems "no direct conversion from Doc to CoNLL is actually done"... Is my grasping correct?
In Prodigy forum (another product of the same designers of spaCy), I found this purposal, however that "CoNLL" (2003 I suppose?) format seems to be incomplete: the POS tag seems to be missing (which can be easily obtained via Token.pos_, as well as the "Syntactic chunk" (whose spaCy equivalent, does not seem to exist). These four fields are mentioned in CoNLL 2003 official documentation.
Speaking of a "direct conversion from Doc to CoNLL", I have also found this implementation based on textacy library, but it seems this implementation got deprecated by version 0.11.0, because "CONLL-U [...] wasn't enforced or guaranteed" , so I am not sure whether to use it or not (BTW, the most up-to-date textacy implementation when writing these lines, is 0.12.0)
My current code looks like:
import spacy
from spacy.training import offsets_to_biluo_tags
from spacy.tokens import Span
print("SPACY HELPER MODEL")
base_model = "en_core_web_sm"
nlp = spacy.load(base_model)
to_disable= ['parser', 'lemmatizer', 'ner']
_ = [nlp.remove_pipe(item) for item in to_disable]
print("Base model used: ", base_model)
print("Removed components: ", to_disable)
print("Enabled components: ", nlp.pipe_names)
# Assume text is already available as sentences...
# so no need for spaCy `sentencizer` or similar
print("\nDEMO SPACY DOC LIST BUILDING...", end="")
doc1 = nlp("iPhone X is coming.")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
doc2 = nlp("Space X is nice.")
doc2.ents = [Span(doc1, 0, 2, label="BRAND")]
docs = [doc1, doc2]
print("DONE!")
print("\nCoNLL 2003 CONVERSION:\n")
results = []
for doc in docs:
# Preliminary: whole sentence
whole_sentence = doc.text
# 1st item (CoNLL 2003): word
words = [token.text for token in doc]
# 2nd item (CoNLL 2003): POS
pos = [token.tag_ for token in doc]
# 3rd item (CoNLL 2003): syntactic chunk tag
sct = ["[UNKNOWN]" for token in doc]
# 4th item (CoNLL 2003): named entities
spacy_entities = [
(ent.start_char, ent.end_char, ent.label_)
for ent in doc.ents
]
biluo_entities = offsets_to_biluo_tags(doc, spacy_entities)
results.append((whole_sentence, words, pos, sct, biluo_entities))
for result in results:
print(
"\nDOC TEXT (NOT included in CoNLL 2003, just for demo): ",
result[0], "\n"
)
print("-DOCSTART- -X- -X- O")
for w,x,y,z in zip(result[1], result[2], result[3], result[4]):
print(w,x,y,z)
# Pending: write to a file, but that's easy, and out of topic.
Which gives as output:
DOC TEXT (NOT included in CoNLL 2003, just for demo): iPhone X is coming.
-DOCSTART- -X- -X- O
iPhone NNP [UNKNOWN] B-GADGET
X NNP [UNKNOWN] L-GADGET
is VBZ [UNKNOWN] O
coming VBG [UNKNOWN] O
. . [UNKNOWN] O
DOC TEXT (NOT included in CoNLL 2003, just for demo): Space X is nice.
-DOCSTART- -X- -X- O
Space NNP [UNKNOWN] B-BRAND
X NNP [UNKNOWN] L-BRAND
is VBZ [UNKNOWN] O
nice JJ [UNKNOWN] O
. . [UNKNOWN] O
Have you done something like this before?
Thanks!

If you look at a sample CoNLL file, you'll see they just separate entries with one blank line between them. So you just use a for loop.
for doc in docs:
for sent in doc.sents:
print("#", doc) # optional but makes it easier to read
print(sent._.conll_str)
print()
CoNLL files are split by sentence, not spaCy Doc, but if you don't have sentence boundaries you can just loop over docs. There also seems to be an option to turn on headers directly in the component, see their README.

not sure if this could help or not, but here's what I can add,
Spark-NLP NER won't use your POS tags, so if you could just fill them with foo-bar values, that could simplify your work.
Check JSL Annotation Lab product. It allows you to label data, it smoothly integrates with Spark-NLP NER. It's free.

With #AlbertoAndreotti's help, I managed to get to a functional workaround:
import spacy
from spacy.training import offsets_to_biluo_tags
from spacy.tokens import Span
print("SPACY HELPER MODEL")
base_model = "en_core_web_sm"
nlp = spacy.load(base_model)
to_disable= ['parser', 'lemmatizer', 'ner']
_ = [nlp.remove_pipe(item) for item in to_disable]
print("Base model used: ", base_model)
print("Removed components: ", to_disable)
print("Enabled components: ", nlp.pipe_names)
# Assume text is already available as sentences...
# so no need for spaCy `sentencizer` or similar
print("\nDEMO SPACY DOC LIST BUILDING...", end="")
doc1 = nlp("iPhone X is coming.")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
doc2 = nlp("Space X is nice.")
doc2.ents = [Span(doc1, 0, 2, label="BRAND")]
docs = [doc1, doc2]
print("DONE!")
print("\nCoNLL 2003 CONVERSION:\n")
results = []
for doc in docs:
# Preliminary: whole sentence
whole_sentence = doc.text
# 1st item (CoNLL 2003): word
words = [token.text for token in doc]
# 2nd item (CoNLL 2003): POS
pos = [token.tag_ for token in doc]
# 3rd item (CoNLL 2003): syntactic chunk tag
# sct = pos # Redundant, so will be left out
# 4th item (CoNLL 2003): named entities
spacy_entities = [
(ent.start_char, ent.end_char, ent.label_)
for ent in doc.ents
]
biluo_entities = offsets_to_biluo_tags(doc, spacy_entities)
results.append((whole_sentence, words, pos, biluo_entities))
for result in results:
print(
"\nDOC TEXT (NOT included in CoNLL 2003, just for demo): ",
result[0], "\n"
)
print("-DOCSTART- -X- -X- O")
for w,x,y,z in zip(result[1], result[2], result[2], result[3]):
print(w,x,y,z)
As complementary information, I found out that the 3rd missing item, "syntactic chunking tag", is related to a broader problem called "phrase chunking", that happens to be an unsolved problem in Computer Science, for which only aproximations have been got, so regardless of the library used, the conversion of that 3rd item specifically, into CoNLL 2033, might have errors. However, it seems Spark NLP does not care at all about 2nd & 3rd items, so the workaround suggested here, is acceptable.
For more details, you might want to put an eye on this thread.

Related

Python NLP Spacy : improve bi-gram extraction from a dataframe, and with named entities?

I am using Python and spaCy as my NLP library, working on a big dataframe that contains feedback about different cars, which looks like this:
'feedback' column contains natural language text to be processed,
'lemmatized' column contains lemmatized version of the feedback text,
'entities' column contains named entities extracted from the feedback text (I've trained the pipeline so that it will recognise car models and brands, labelling these as 'CAR_BRAND' and 'CAR_MODEL')
I then created the following function, which applies the Spacy nlp token to each row of my dataframe and extract any [noun + verb], [verb + noun], [adj + noun], [adj+ proper noun] combinations.
def bi_gram(x):
doc = nlp_token(x)
result = []
text = ''
for i in range(len(doc)):
j = i+1
if j < len(doc):
if (doc[i].pos_ == "NOUN" and doc[j].pos_ == "VERB") or (doc[i].pos_ == "VERB" and doc[j].pos_ == "NOUN") or (doc[i].pos_ == "ADJ" and doc[j].pos_ == "NOUN") or (doc[i].pos_ == "ADJ" and doc[j].pos_ == "PROPN"):
text = doc[i].text + " " + doc[j].text
result.append(text)
i = i+1
return result
Then I applied this function to 'lemmatized' column:
df['bi_gram'] = df['lemmatized'].apply(bi_gram)
This is where I have a problem...
This is producing only one bigram per row maximum. How can I tweak the code so that more than one bigram can be extracted and put in a column? (Also are there more linguistic combinations I should try?)
Is there a possibility to find out what people are saying about 'CAR_BRAND' and 'CAR_MODEL' named entities extracted in the 'entities' column? For example 'Cool Porsche' - Some brands or models are made of more than two words so it's tricky to tackle.
I am very new to NLP.. If there is a more efficient way to tackle this, any advice will be super helpful!
Many thanks for your help in advance.
spaCy has a built-in pattern matching engine that's perfect for your application – it's documented here and in a more extensive usage guide. It allows you to define patterns in a readable and easy-to-maintain way, as lists of dictionaries that define the properties of the tokens to be matched.
Set up the pattern matcher
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm") # or whatever model you choose
matcher = Matcher(nlp.vocab)
# your patterns
patterns = {
"noun_verb": [{"POS": "NOUN"}, {"POS": "VERB"}],
"verb_noun": [{"POS": "VERB"}, {"POS": "NOUN"}],
"adj_noun": [{"POS": "ADJ"}, {"POS": "NOUN"}],
"adj_propn": [{"POS": "ADJ"}, {"POS": "PROPN"}],
}
# add the patterns to the matcher
for pattern_name, pattern in patterns.items():
matcher.add(pattern_name, [pattern])
Extract matches
doc = nlp("The dog chased cats. Fast cats usually escape dogs.")
matches = matcher(doc)
matches is a list of tuples containing
a match id,
the start index of the matched bit and
the end index (exclusive).
This is a test output adopted from the spaCy usage guide:
for match_id, start, end in matches:
# Get string representation
string_id = nlp.vocab.strings[match_id]
# The matched span
span = doc[start:end]
print(repr(span.text))
print(match_id, string_id, start, end)
print()
Result
'dog chased'
1211260348777212867 noun_verb 1 3
'chased cats'
8748318984383740835 verb_noun 2 4
'Fast cats'
2526562708749592420 adj_noun 5 7
'escape dogs'
8748318984383740835 verb_noun 8 10
Some ideas for improvement
Named entity recognition should be able to detect multi-word expressions, so brand and/or model names that consist of more than one token shouldn't be an issue if everything is set up correctly
Matching dependency patterns instead of linear patterns might slightly improve your results
That being said, what you're trying to do – kind of sentiment analysis -is quite a difficult task that's normally engaged with machine learning approaches and heaps of training data. So don't expect too much from simple heuristics.

How to exclude sentences from Spacy results if it contains a token with a specific dep_?

I would like to negative filter Spacy results. Actually, I would like to get sentences includes only 'pobj' but not 'dobj' in dependency parsing. However, since sentences with 'dobj' are likely to included 'pobj' but not vice versa, Spacy lists also sentences with 'dobj' included.
For instance;
'He pushed the book off the shelf':
He nsubj
pushed ROOT
the det
book dobj
off prep
the det
shelf pobj
'The book fell off the table'
The det
book nsubj
fell ROOT
off prep
the det
table pobj
In both sentence, prep is the immediate head of pobj, therefore;
doc = nlp('He pushed the book off the shelf.The book fell off the table')
for t in doc:
if t.dep_ == 'pobj':
print(t.sent)
would give me the both sentences in return. How can I negative filter correctly to not to list sentences including both 'dobj' and 'pobj' but to list sentence only 'pobj' included
Well after many attempts, I found the solution as follows;
for a in doc:
if a.dep_ == "prep" and a.pos_ == "ADP" and a.head.pos_ == "VERB":
for b in a.head.children:
if b.dep_ == "nsubj":
sents = [t.sent for t in a.sent]
for n in sents:
for c in n:
if c.dep_ == 'dobj':
pattern2_sents = [c.sent]
if c.dep_ != 'pobj':
pattern4_sents = [c.sent]
However I am not sure why simple iterating if token.dep_ != 'dobj' would not work in the original question.

How to average the vector when merging (with retokenize) custom noun chunks in spaCy?

I am generating noun chunks using spaCy's statistical models (e.g. noun chunks based on the part-of-speech tags and dependencies) and rule-based matching to capture (user supplied) context specific noun chunks.
For other downstream tasks, I am retokenizing these noun chunks (spans), which works fine for the most part. However, the token's vector representation (token.vector) gets set to all zeros. Is there a way to retain the vector information e.g. by averaging the individual token vectors and assigning it to the retokenised token?
I tried this with the code...
def tokenise_noun_chunks(doc)
if not doc.has_annotation("DEP"):
return doc
all_noun_chunks = list(doc.noun_chunks) + doc._.custom_noun_chunks
with doc.retokenize() as retokenizer:
for span in all_noun_chunks:
# if I print(span.vector) here, I get the correctly averaged vector
attrs = {"tag": span.root.tag, "dep": span.root.dep}
retokenizer.merge(np, attrs=attrs)
return doc
...but when I check the returned vectors for the noun chunks, I get a zeros array. I've modelled this (the above code) on the built-in merge_noun_chunks component (just modified to include my own custom noun chunks), so I can confirm that the built in component gives the same results.
Is there any way to keep the word vector information? Will I need to add the term/average vector to the Vector store?
The retokenizer should set span.vector as the vector for the new merged token. With spacy==3.0.3 and en_core_web_md==3.0.0:
import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("This is a sentence.")
with doc.retokenize() as retokenizer:
for chunk in doc.noun_chunks:
retokenizer.merge(chunk)
for token in doc:
print(token, token.vector[:5])
Output:
This [-0.087595 0.35502 0.063868 0.29292 -0.23635 ]
is [-0.084961 0.502 0.0023823 -0.16755 0.30721 ]
a sentence [-0.093156 0.1371495 -0.307255 0.2993 0.1383735]
. [ 0.012001 0.20751 -0.12578 -0.59325 0.12525 ]
Attributes like tag and dep are also set to those of span.root by default, so you only need to specify them if you want to override the defaults.

Extracting sentence from a dataframe with description column based on a phrase

I have a dataframe with a 'description' column with details about the product. Each of the description in the column has long paragraphs. Like
"This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much"
How do I locate/extract the sentence which has the phrase "superb product", and place it in a new column?
So for this case the result will be
expected output
I have used this,
searched_words=['superb product','SUPERB PRODUCT']
print(df['description'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if stemmer.stem(w.lower()) in searched_words)]))
The output for this is not suitable. Though it works if I put just one word in " Searched Word" List.
There are lot of methods to do that ,#ChootsMagoots gave you the good answer but SPacy is also so efficient, you can simply choose the pattern that will lead you to that sentence, but beofre that, you can need to define a function that will define the sentence here's the code :
import spacy
def product_sentencizer(doc):
''' Look for sentence start tokens by scanning for periods only. '''
for i, token in enumerate(doc[:-2]): # The last token cannot start a sentence
if token.text == ".":
doc[i+1].is_sent_start = True
else:
doc[i+1].is_sent_start = False # Tell the default sentencizer to ignore this token
return doc
nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.add_pipe(product_sentencizer, before="parser") # Insert before the parser can build its own sentences
text = "This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much."
doc = nlp(text)
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'SUPERB PRODUCT'}]
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
print(matched_span.sent)
Assuming the paragraphs are neatly formatted into sentences with ending periods, something like:
for index, paragraph in df['column_name'].iteritems():
for sentence in paragraph.split('.'):
if 'superb prod' in sentence:
print(sentence)
df['extracted_sentence'][index] = sentence
This is going to be quite slow, but idk if there's a better way.

Probability tree for sentences in nltk employing both lookahead and lookback dependencies

Does nltk or any other NLP tool allow to construct probability trees based on input sentences thus storing the language model of the input text in a dictionary tree, the following example gives the rough idea, but I need the same functionality such that a word Wt does not just probabilistically modelled on past input words(history) Wt-n but also on lookahead words like Wt+m. Also the lookback and lookahead word count should also be 2 or more i.e. bigrams or more. Are there any other libraries in python which achieve this?
from collections import defaultdict
import nltk
import math
ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token in ngram:
total = math.log10(sum(ngram[token].values()))
ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}
the solution requires both lookahead and lookback and a specially sub classed dictionary may help in solving this problem. Can also point to relevant resources which talk about implementing such a system. A nltk.models seemed to be doing something similar but is no longer available. Are there any existing design patterns in NLP which implement this idea? skip gram based models are similar to this idea too but I feel this has should have been implemented already somewhere.
If I understand your question correctly, you are looking for a way to predict the probability of a word given its surrounding context (not just backward context but also the forward context).
One quick hack for your purpose is to train two different language models. One from right to left and the other from left to right and then probability of a word given its context would be the normalized sum of both forward and backward contexts.
Extending your code:
from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
import numpy as np
ngram = defaultdict(lambda: defaultdict(int))
ngram_rev = defaultdict(lambda: defaultdict(int)) #reversed n-grams
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token, rev_token in zip(tokens[1:], tokens):
ngram_rev[token][rev_token] += 1
for token in ngram:
total = np.log(np.sum(ngram[token].values()))
total_rev = np.log(np.sum(ngram_rev[token].values()))
ngram[token] = {nxt: np.log(v) - total
for nxt, v in ngram[token].items()}
ngram_rev[token] = {prv: np.log(v) - total_rev
for prv, v in ngram_rev[token].items()}
Now the context is in both ngram and ngram_rev which respectively hold the forward and backward contexts.
You should also account for smoothing. That is if a given phrase is not seen in your training corpus, you would just get zero probabilities. In order to avoid that, there are many smoothing techniques the most simple of which is the add-on smoothing.
The normal ngram algorithm traditionally works with prior context only, and for good reason: A bigram tagger makes decisions by considering the tags of the last two words, plus the current word. So unless you tag in two passes, the tag of the next word is not yet known. But you are interested in word ngrams, not tag ngrams, so nothing keeps you from training an ngram tagger where the ngram consists of words from both sides. And you can indeed do it easily with the NLTK.
The NLTK's ngram taggers all make tag ngrams, from the left; but you can easily derive your own tagger from their abstract base class, ContextTagger:
import nltk
from nltk.tag import ContextTagger
class TwoSidedTagger(ContextTagger):
left = 2
right = 1
def context(self, tokens, index, history):
left = self.left
right = self.right
tokens = tuple(t.lower() for t in tokens)
if index < left:
tokens = ("<start>",) * left + tokens
index += left
if index + right >= len(tokens):
tokens = tokens + ("<end>",) * right
return tokens[index-left:index+right+1]
This defines a tetragram tagger (2+1+1) where the current word is third in the ngram, not last as usual. You can then initialize and train a tagger just like the regular ngram taggers (see chapter 5 of the NLTK book, especially sections 5.4ff). Let's see first how you'd build a part-of-speech tagger, using a portion of the Brown corpus as training data:
data = list(nltk.corpus.brown.tagged_sents(categories="news"))
train_sents = data[400:]
test_sents = data[:400]
twosidedtagger = TwoSidedTagger({}, backoff=nltk.DefaultTagger('NN'))
twosidedtagger._train(train_sents)
Like all ngram taggers in the NLTK, this one will delegate to the backoff tagger if it is asked to tag an ngram it did not see during training.
For simplicity I used a simple "default tagger" as the backoff tagger, but you'll probably need to use something more powerful (see the NLTK chapter again).
You can then use your tagger to tag new text, or evaluate it with an already tagged test set:
>>> print(twosidedtagger.tag("There were dogs everywhere .".split()))
>>> print(twosidedtagger.evaluate(test_sents))
Predicting words:
The above tagger assigns a POS tag by considering nearby words; but your goal is to predict the word itself, so you need different training data and a different default tagger. The NLTK API expects training data in the form (word, LABEL), where LABEL is the value you want to generate. In your case, LABEL is just the current word itself; so make your training data as follows:
data = [ zip(s,s) for s in nltk.corpus.brown.sents(categories="news") ]
train_sents = data[400:]
test_sents = data[:400]
twosidedtagger = TwoSidedTagger({}, backoff=nltk.DefaultTagger('the')) # most common word
twosidedtagger._train(train_sents)
It makes no sense for the target word to appear in the "context" ngram, so you should also modify the method context() so that the returned ngram does not include it:
def context(self, tokens, index, history):
...
return tokens[index-left:index] + tokens[index+1:index+right+1]
This tagger uses trigrams consisting of two words from the left and one from the right of the current word.
With these modifications, you'll build a tagger that outputs the most likely word at any position. Try it and how you like it.
Prediction:
My expectation is that you'll need a humongous amount of training data before you can get decent performance. The problem is that ngram taggers can only suggest a tag for contexts they saw during training.
To build a tagger that generalizes, consider using the NLTK to train a "sequential classifier". You can use whatever features you want, including the words before and after-- of course, how well it will work is your problem. The NLTK classifier API is similar to that for the ContextTagger, but the context function (aka feature function) returns a dictionary, not a tuple. Again, see the NLTK book and the source code.

Categories

Resources