multiple SpaCy doc objects and want to combine them into one object

multiple SpaCy doc objects and want to combine them into one object - python

I am trying to tokenize a text file of the King James bible but when I attempt it I get a memory error. So I have divided the text up into multiple objects. Now I want to use spaCy to tokenize the objects and then recombine them into one doc object. I have seen others talking about similar issues and converting into arrays and then back to a doc after combining the arrays. Will this work to fix my issue or create new issues later?
I tried running this but colab nor my computer do not have the RAM to support it.
nlp_spacy = spacy.load('en')
kjv_bible = gutenberg.raw('bible-kjv.txt')
#pattern for bracketed text titles
bracks = "[\[].*?[\]]"
kjv_bible = re.sub(bracks, "", kjv_bible)
kjv_bible = ' '.join(kjv_bible.split())
len(kjv_bible)
kjv_bible_doc = nlp_spacy(kjv_bible)
ValueError Traceback (most recent call
last)
<ipython-input-19-385936fadd40> in <module>()
----> 1 kjv_bible_doc = nlp_spacy(kjv_bible)
/usr/local/lib/python3.6/dist-packages/spacy/language.py in
__call__(self, text, disable, component_cfg)
378 if len(text) > self.max_length:
379 raise ValueError(
--> 380 Errors.E088.format(length=len(text),
max_length=self.max_length)
381 )
382 doc = self.make_doc(text)
ValueError: [E088] Text of length 4305663 exceeds maximum of 1000000.
The v2.x parser and NER models require roughly 1GB of temporary memory
per 100,000 characters in the input. This means long texts may cause
memory allocation errors. If you're not using the parser or NER, it's
probably safe to increase the `nlp.max_length` limit. The limit is in
number of characters, so you can check whether your inputs are too
long by checking `len(text)`.
nlp.max_length = 4305663
kjv_bible_doc = nlp_spacy(kjv_bible)
results in crashing the notebook due to RAM memory
will this work
np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
np_array.extend(np_array2)
doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array)

You can use the function Doc.from_docs to concatenate multiple Doc objects. Here is the example from the spaCy documentation:
from spacy.tokens import Doc
texts = ["London is the capital of the United Kingdom.",
"The River Thames flows through London.",
"The famous Tower Bridge crosses the River Thames."]
docs = list(nlp.pipe(texts))
c_doc = Doc.from_docs(docs)
assert str(c_doc) == " ".join(texts)
assert len(list(c_doc.sents)) == len(docs)
assert [str(ent) for ent in c_doc.ents] == \
[str(ent) for doc in docs for ent in doc.ents]

If you increase the max_length, it will crash unless you explicitly disable the components that use a lot of memory (parser and NER). If you're only using the tokenizer, you can disable everything except the tokenizer when you load the model:
nlp = spacy.load('en', disable=['tagger', 'parser', 'ner'])

Related

TypeError in torch.argmax() when want to find the tokens with the highest `start` score

I want to run this code for question answering using hugging face transformers.
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
#Model
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
#Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
question = '''Why was the student group called "the Methodists?"'''
paragraph = ''' The movement which would become The United Methodist Church began in the mid-18th century within the Church of England.
A small group of students, including John Wesley, Charles Wesley and George Whitefield, met on the Oxford University campus.
They focused on Bible study, methodical study of scripture and living a holy life.
Other students mocked them, saying they were the "Holy Club" and "the Methodists", being methodical and exceptionally detailed in their Bible study, opinions and disciplined lifestyle.
Eventually, the so-called Methodists started individual societies or classes for members of the Church of England who wanted to live a more religious life. '''
encoding = tokenizer.encode_plus(text=question,text_pair=paragraph)
inputs = encoding['input_ids'] #Token embeddings
sentence_embedding = encoding['token_type_ids'] #Segment embeddings
tokens = tokenizer.convert_ids_to_tokens(inputs) #input tokens
start_scores, end_scores = model(input_ids=torch.tensor([inputs]), token_type_ids=torch.tensor([sentence_embedding]))
start_index = torch.argmax(start_scores)
but I get this error at the last line:
Exception has occurred: TypeError
argmax(): argument 'input' (position 1) must be Tensor, not str
File "D:\bert\QuestionAnswering.py", line 33, in <module>
start_index = torch.argmax(start_scores)
I don't know what's wrong. can anyone help me?

BertForQuestionAnswering returns a QuestionAnsweringModelOutput object.
Since you set the output of BertForQuestionAnswering to start_scores, end_scores, the return QuestionAnsweringModelOutput object is forced convert to a tuple of strings ('start_logits', 'end_logits') causing the type mismatch error.
The following should work:
outputs = model(input_ids=torch.tensor([inputs]), token_type_ids=torch.tensor([sentence_embedding]))
start_index = torch.argmax(outputs.start_logits)

Huggingface transformers provide a simple high-level way of running the model, as shown in this guide:
from transformers import pipeline
nlp = pipeline('question-answering', model=model, tokenizer=tokenizer)
print(nlp(question=question, context=paragraph, topk=5))
topk allows to select several top-scoring answers.

Python sklearn TfidfVectorizer: Vectorize documents ahead of query for semantic search

I want to run semantic search using TF-IDF.
This code works, but it is really slow when used on a large corpus of documents:
search_terms = "my query"
documents = ["my","list","of","docs"]
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform([search_terms] + documents)
cosine_similarities = linear_kernel(doc_vectors[0:1], doc_vectors).flatten()
document_scores = [item.item() for item in cosine_similarities[1:]]
It seems quite inefficient:
Every new search query triggers a re-vectorizing of the entire corpus.
I am wondering how I can do the bulk work of vectorizing my corpus ahead of time, saving the result in an "index file". So that, when I run a query, the only thing left to do is to vectorize the few words from the query, and then to calculate similarity.
I tried vectorizing query and documents separately:
vec_docs = vectorizer.fit_transform(documents)
vec_query = vectorizer.fit_transform([search_terms])
cosine_similarities = linear_kernel(vec_query, vec_docs).flatten()
But it gives me this error:
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 3 while Y.shape[1] == 260541
How can I run the corpus vectorization ahead of time without knowing what the query will be?
My main goal is to get blazing fast results even with a large corpus of documents (say, a few GB worth of text), even on a low-powered server, by doing the bulk of the data-crunching ahead of time.

TF/IDF vectors are high-dimensional and sparse. The basic data structure that supports that is an inverted index. You can either implement it yourself or use a standard index (e.g., Lucene).
Nevertheless, if you would like to experiment with modern deep-neural-based vector representations, check out the following semantic search demo. It uses a similarity search service that can handle billions of vectors.
(Note, I am a co-author of this demo.)

You almost have it right.
In this instance, you can get away with fitting (and transforming) your documents and only transforming your search terms. Here is your code, modified accordingly and using the twenty_newsgroups documents (11k) in its place. You can run it as a script and interactively verify you get fast results:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
news = fetch_20newsgroups()
search_terms = "my query"
# documents = ["my", "list", "of", "docs"]
documents = news.data
vectorizer = TfidfVectorizer()
# fit_transform does two things: fits the vectorizer and transforms documents
doc_vectors = vectorizer.fit_transform(documents)
# the vectorizer is already fit; just transform search_terms via vectorizer
search_term_vector = vectorizer.transform([search_terms])
cosine_similarities = linear_kernel(doc_vectors, search_term_vector).flatten()
if __name__ == "__main__":
while True:
query_str = input("\n\n\n\nquery string (return to quit): ")
if not query_str:
print("bye!")
break
search_term_vector = vectorizer.transform([query_str])
cosine_similarities = linear_kernel(doc_vectors, search_term_vector).flatten()
best_idx = np.argmax(cosine_similarities)
best_score = cosine_similarities[best_idx]
best_doc = documents[best_idx]
if best_score < 0.1:
print("no good matches")
else:
max_doc = documents[np.argmax(cosine_similarities)]
print(
f"Best match ({round(best_score, 4)}):\n\n", best_doc[0:200] + "...",
)
Example output:
query string (return to quit): protocol
Best match 0.239 (0.014 sec):
From: ethan#cs.columbia.edu (Ethan Solomita)
Subject: Re: X protocol packet type
Article-I.D.: cs.C52I2q.IFJ
Organization: Columbia University Department of Computer Science
Lines: 7
In article <9309...
Note: this algorithm find the best match(es) at best in O(n_documents) time, compared to Lucene (powers Elasticsearch) that uses skip lists that can search in O(log(n_documents)). Production search engines also have quiet a bit of tuning to optimize performance. The above could be useful with some tweaking but isn't going to topple Google tomorrow :)

Doc2Vec not providing adequate results in most_similar

I'm trying to use Doc2Vec to go through the classic exercise of training on Wikipedia articles, using the article title as the tag.
Here's my code and the results, is there something that I'm missing that they would not give the matching results with most_similar? Following this tutorial, but I used the wiki-english-20171001 dataset that came with gensim.
import gensim.downloader as api
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import re
def cleanText(text):
text = re.sub(r'\|\|\|', r' ', text)
text = re.sub(r'http\S+', r'<URL>', text)
text = text.lower()
text = re.sub(r'[^\w\s]','',text)
return text
wiki = api.load("wiki-english-20171001")
data = [d for d in wiki]
for i in range(10):
print(data[i])
def my_create_tagged_docs(data):
for wikiidx in range(len(data)):
yield TaggedDocument([i for i in data[wikiidx].get('section_texts') for i in cleanText(i).split()], [data[wikiidx].get('title')])
wiki_data = my_create_tagged_docs(data)
del data
del wiki
model = Doc2Vec(dm=1, dm_mean=1, size=200, window=8, min_count=19, iter =10, epochs=40)
model.build_vocab(wiki_data)
model.train(wiki_data, total_examples=model.corpus_count, epochs=model.epochs)
model.docvecs.most_similar(positive=["Lady Gaga"], topn=10)
[('Chlorothrix', 0.35521823167800903),
("A Child's Garden of Verses", 0.3533579707145691),
('Fish Mooney', 0.35129639506340027),
('2000 Paris–Roubaix', 0.3463437855243683),
('Calvin C. Chaffee', 0.3439667224884033),
('Murders of Eve Stratford and Lynne Weedon', 0.3397218585014343),
('Black Air', 0.3396576941013336),
('Turzyn', 0.3312540054321289),
('Scott Baker', 0.33018186688423157),
('Amongst the Waves', 0.3297169804573059)]
model.docvecs.most_similar(positive=["Machine learning"], topn=10)
[('Wolf Rock, Connecticut', 0.3855834901332855),
('Amália Rodrigues', 0.3349645137786865),
('Victoria Park, Leicester', 0.33312514424324036),
('List of visual anthropology films', 0.3311382532119751),
('Sadqay Teri Mout Tun', 0.3287636637687683),
('T. Damodaran', 0.32876330614089966),
('Urqu Jawira (Aroma)', 0.32281631231307983),
('Tiggy Wiggy', 0.3226730227470398),
('Frédéric Brun (cyclist, born 1988)', 0.32106447219848633),
('Unholy Crusade', 0.3200794756412506)]

It looks like your wiki_data is a single-pass generator, as returned by my_create_tagged_docs(), which can be iterated over only once - not an iterable object capable of many iterations, as the many steps of the Doc2Vec training requires.
You can test your wiki_data object for whether it's multiply-iterable, just after it's been assigned, by executing:
print(sum(1 for _ in wiki_data))
print(sum(1 for _ in wiki_data))
If you see the same number twice – the total number of documents – all's well. If the 2nd number is 0, you've created a single-use iterator instead of a multiple-use iterable.
As a result, the build_vocab() call will work to initialize the known-vocabulary & model – but then the train() will see an empty iterable, completing instantly with no real training happening. (If you run with logging at the INFO level, this may be obvious in the log timestamps for the various steps.)
Two possible fixes:
If you're lucky enough to have enough RAM to hold the whole corpus as Python objects, converting it into a in-memory list would ensure it's multiple-iterable:
wiki_data = list(my_create_tagged_docs(data))
But, most won't have that much RAM * shouldn't/needn't take that step. Instead, you can define a class for an iterable view on the data, which can return a fresh iterator every time it's needed. There's an example with further explanation in a blog post by the founder of the gensim project at:
https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/

Doc2Vec: tuTypeError: '<' not supported between instances of 'str' and 'int'

I created a doc2vec model to determine most similar documents :
Here the code for training :
#train doc2vec model
docs = g.doc2vec.TaggedLineDocument(train_corpus)
model = g.Doc2Vec(docs, dm=0, dbow_words=1, size=200, window=8, min_count=19, iter=2)
For ineference I try this :
#load model
m = g.Doc2Vec.load(model)
pprint(m.docvecs.most_similar(positive=["Machine learning"], topn=20))
But i Got this error :
TypeError Traceback (most recent call last)
<ipython-input-142-ca36e85d7a79> in <module>
----> 1 pprint(m.docvecs.most_similar(positive=["Machine learning"], topn=20))
~\Anaconda3\lib\site-packages\gensim\models\keyedvectors.py in most_similar(self, positive, negative, topn, clip_start, clip_end, indexer)
1687 if isinstance(doc, ndarray):
1688 mean.append(weight * doc)
-> 1689 elif doc in self.doctags or doc < self.count:
1690 mean.append(weight * self.vectors_docs_norm[self._int_index(doc, self.doctags, self.max_rawint)])
1691 all_docs.add(self._int_index(doc, self.doctags, self.max_rawint))
TypeError: '<' not supported between instances of 'str' and 'int'
Any idea please?

It's a known bug pending a fix that if you supply a tag to doc2vec_model.docvecs.most_similar() that's not known to the model, it will show this confusing error.
So, "Machine learning" is not a tag that was supplied during training. In fact, the TaggedLineDocument class simply gives each document a single tag based on its line-number in the corpus file. If you want more sophisticated/descriptive tags, you'll have to prep the corpus yourself, to present individual objects (shaped like TaggedDocument) with both a list-of-words words property and a list-of-tags tags property.

I had the same error and i found a quick fix for it.
The reason you have this error is because your function : most_similar(positive=["Machine learning"] takes a list of tokens in the parameter positive so you need to give a list of words and not a sentence.
Here is a fix for you :
def process_query(query):
words = []
words = query.split()
return words
query = "machine learning"
l = process_query(query)
sim = model.most_similar(positive=l,topn=20)

Training IOB Chunker using nltk.tag.brill_trainer (Transformation-Based Learning)

I'm trying to train a specific chunker (let's say a noun chunker for simplicity) by using NLTK's brill module. I'd like to use three features, ie. word, POS-tag, IOB-tag.
(Ramshaw and Marcus, 1995:7) have shown 100 templates which are generated from the combination of those three features, for example,
W0, P0, T0 # current word, pos tag, iob tag
W-1, P0, T-1 # prev word, pos tag, prev iob tag
...
I want to incorporate them into nltk.tbl.feature, but there are only two kinds of feature objects, ie. brill.Word and brill.Pos. Limited by the design, I could only put word and POS feature together like (word, pos), and thus used ( (word, pos), iob) as features for training. For example,
from nltk.tbl import Template
from nltk.tag import brill, brill_trainer, untag
from nltk.corpus import treebank_chunk
from nltk.chunk.util import tree2conlltags, conlltags2tree
# Codes from (Perkins, 2013)
def train_brill_tagger(initial_tagger, train_sents, **kwargs):
templates = [
brill.Template(brill.Word([0])),
brill.Template(brill.Pos([-1])),
brill.Template(brill.Word([-1])),
brill.Template(brill.Word([0]),brill.Pos([-1])),]
trainer = brill_trainer.BrillTaggerTrainer(initial_tagger, templates, trace=3,)
return trainer.train(train_sents, **kwargs)
# generating ((word, pos),iob) pairs as feature.
def chunk_trees2train_chunks(chunk_sents):
tag_sents = [tree2conlltags(sent) for sent in chunk_sents]
return [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents]
>>> from nltk.tag import DefaultTagger
>>> tagger = DefaultTagger('NN')
>>> train = treebank_chunk.chunked_sents()[:2]
>>> t = chunk_trees2train_chunks(train)
>>> bt = train_brill_tagger(tagger, t)
TBL train (fast) (seqs: 2; tokens: 31; tpls: 4; min score: 2; min acc: None)
Finding initial useful rules...
Found 79 useful rules.
B |
S F r O | Score = Fixed - Broken
c i o t | R Fixed = num tags changed incorrect -> correct
o x k h | u Broken = num tags changed correct -> incorrect
r e e e | l Other = num tags changed incorrect -> incorrect
e d n r | e
------------------+-------------------------------------------------------
12 12 0 17 | NN->I-NP if Pos:NN#[-1]
3 3 0 0 | I-NP->O if Word:(',', ',')#[0]
2 2 0 0 | I-NP->B-NP if Word:('the', 'DT')#[0]
2 2 0 0 | I-NP->O if Word:('.', '.')#[0]
As shown above, (word, pos) are treated one feature as a whole. This is not a perfect capture of three features (word, pos-tag, iob-tag).
Any other ways to implement word, pos, iob features seperately into nltk.tbl.feature?
If it is impossible in NLTK, are there other implementations of them in python? I was only able to find C++ and Java implementations on the internet.

The nltk3 brill trainer api (I wrote it) does handle training on sequences of tokens described with multidimensional
features, as your data is an example of. However, the practical limits may be severe. The number of possible templates in multidimensional learning
increases drastically, and the current nltk implementation of the brill trainer trades memory
for speed, similar to Ramshaw and Marcus 1994, "Exploring the statistical derivation of transformation-rule sequences...".
Memory consumption may be HUGE and
it is very easy to give the system more data and/or templates than
it can handle. A useful strategy is to rank
templates according to how often they produce good rules (see
print_template_statistics() in the example below).
Usually, you can discard the lowest-scoring fraction (say 50-90%)
with little or no loss in performance and a major decrease in training time.
Another or additional possibility is to use the nltk
implementation of Brill's original algorithm, which has very different memory-speed tradeoffs; it does no indexing and so will use much less memory. It uses some optimizations and is actually rather quick in finding the very best rules, but is generally extremely slow towards end of training when there are many competing, low-scoring candidates. Sometimes you don't need those, anyway. For some reason this implementation seems to have been omitted from newer nltks, but here is the source (I just tested it) http://www.nltk.org/_modules/nltk/tag/brill_trainer_orig.html.
There are other algorithms with other tradeoffs, and
in particular the fast memory-efficient indexing algorithms of Florian and Ngai 2000
(http://www.aclweb.org/anthology/N/N01/N01-1006.pdf) and
probabilistic rule sampling of Samuel 1998
(https://www.aaai.org/Papers/FLAIRS/1998/FLAIRS98-045.pdf) would be a useful additions. Also, as you noticed, the documentation is not complete and too much focused on part-of-speech tagging, and it is not clear how to generalize from it. Fixing the docs is (also) on the todo list.
However, the interest for generalized (non-POS-tagging) tbl in nltk has been rather limited (the totally unsuited api of nltk2 was untouched for 10 years), so don't hold your breath. If you get impatient, you may wish to check out more dedicated alternatives,
in particular mutbl and fntbl (google them, I only have reputation for two links).
Anyway, here is a quick sketch for nltk:
First, a hardcoded convention in nltk is that tagged sequences ('tags' meaning any label
you would like to assign to your data, not necessarily part-of-speech) are represented
as sequences of pairs, [(token1, tag1), (token2, tag2), ...]. The tags are strings; in
many basic applications, so are the tokens. For instance, the tokens may be words
and the strings their POS, as in
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
(As an aside, this sequence-of-token-tag-pairs convention is pervasive in nltk and
its documentation, but it should arguably be better expressed as named tuples
rather than pairs, so that instead of saying
[token for (token, _tag) in tagged_sequence]
you could say for instance
[x.token for x in tagged_sequence]
The first case fails on non-pairs, but the second exploits duck typing so
that tagged_sequence could be any sequence of user-defined instances, as long as
they have an attribute "token".)
Now, you could well have a richer representation of what a token is at your
disposal. An existing tagger interface (nltk.tag.api.FeaturesetTaggerI) expects
each token as a featureset rather than a string, which is a dictionary that maps
feature names to feature values for each item in the sequence.
A tagged sequence may then look like
[({'word': 'Pierre', 'tag': 'NNP', 'iob': 'B-NP'}, 'NNP'),
({'word': 'Vinken', 'tag': 'NNP', 'iob': 'I-NP'}, 'NNP'),
({'word': ',', 'tag': ',', 'iob': 'O' }, ','),
...
]
There are other possibilities (though with less support in the rest of nltk).
For instance, you could have a named tuple for each token, or a user-defined
class which allows you to add any amount of dynamic calculation to
attribute access (perhaps using #property to offer a consistent interface).
The brill tagger doesn't need to know what view you currently provide
on your tokens. However, it does require you to provide an initial tagger
which can take sequences of tokens-in-your-representation to sequences of
tags. You cannot use the existing taggers in nltk.tag.sequential directly,
since they expect [(word, tag), ...]. But you may still be able to
exploit them. The example below uses this strategy (in MyInitialTagger), and the token-as-featureset-dictionary view.
from __future__ import division, print_function, unicode_literals
import sys
from nltk import tbl, untag
from nltk.tag.brill_trainer import BrillTaggerTrainer
# or:
# from nltk.tag.brill_trainer_orig import BrillTaggerTrainer
# 100 templates and a tiny 500 sentences (11700
# tokens) produce 420000 rules and uses a
# whopping 1.3GB of memory on my system;
# brill_trainer_orig is much slower, but uses 0.43GB
from nltk.corpus import treebank_chunk
from nltk.chunk.util import tree2conlltags
from nltk.tag import DefaultTagger
def get_templates():
wds10 = [[Word([0])],
[Word([-1])],
[Word([1])],
[Word([-1]), Word([0])],
[Word([0]), Word([1])],
[Word([-1]), Word([1])],
[Word([-2]), Word([-1])],
[Word([1]), Word([2])],
[Word([-1,-2,-3])],
[Word([1,2,3])]]
pos10 = [[POS([0])],
[POS([-1])],
[POS([1])],
[POS([-1]), POS([0])],
[POS([0]), POS([1])],
[POS([-1]), POS([1])],
[POS([-2]), POS([-1])],
[POS([1]), POS([2])],
[POS([-1, -2, -3])],
[POS([1, 2, 3])]]
iobs5 = [[IOB([0])],
[IOB([-1]), IOB([0])],
[IOB([0]), IOB([1])],
[IOB([-2]), IOB([-1])],
[IOB([1]), IOB([2])]]
# the 5 * (10+10) = 100 3-feature templates
# of Ramshaw and Marcus
templates = [tbl.Template(*wdspos+iob)
for wdspos in wds10+pos10 for iob in iobs5]
# Footnote:
# any template-generating functions in new code
# (as opposed to recreating templates from earlier
# experiments like Ramshaw and Marcus) might
# also consider the mass generating Feature.expand()
# and Template.expand(). See the docs, or for
# some examples the original pull request at
# https://github.com/nltk/nltk/pull/549
# ("Feature- and Template-generating factory functions")
return templates
def build_multifeature_corpus():
# The true value of the target fields is unknown in testing,
# and, of course, templates must not refer to it in training.
# But we may wish to keep it for reference (here, truepos).
def tuple2dict_featureset(sent, tagnames=("word", "truepos", "iob")):
return (dict(zip(tagnames, t)) for t in sent)
def tag_tokens(tokens):
return [(t, t["truepos"]) for t in tokens]
# connlltagged_sents :: [[(word,tag,iob)]]
connlltagged_sents = (tree2conlltags(sent)
for sent in treebank_chunk.chunked_sents())
conlltagged_tokenses = (tuple2dict_featureset(sent)
for sent in connlltagged_sents)
conlltagged_sequences = (tag_tokens(sent)
for sent in conlltagged_tokenses)
return conlltagged_sequences
class Word(tbl.Feature):
#staticmethod
def extract_property(tokens, index):
return tokens[index][0]["word"]
class IOB(tbl.Feature):
#staticmethod
def extract_property(tokens, index):
return tokens[index][0]["iob"]
class POS(tbl.Feature):
#staticmethod
def extract_property(tokens, index):
return tokens[index][1]
class MyInitialTagger(DefaultTagger):
def choose_tag(self, tokens, index, history):
tokens_ = [t["word"] for t in tokens]
return super().choose_tag(tokens_, index, history)
def main(argv):
templates = get_templates()
trainon = 100
corpus = list(build_multifeature_corpus())
train, test = corpus[:trainon], corpus[trainon:]
print(train[0], "\n")
initial_tagger = MyInitialTagger('NN')
print(initial_tagger.tag(untag(train[0])), "\n")
trainer = BrillTaggerTrainer(initial_tagger, templates, trace=3)
tagger = trainer.train(train)
taggedtest = tagger.tag_sents([untag(t) for t in test])
print(test[0])
print(initial_tagger.tag(untag(test[0])))
print(taggedtest[0])
print()
tagger.print_template_statistics()
if __name__ == '__main__':
sys.exit(main(sys.argv))
The setup above builds a POS tagger. If you instead wish to target another attribute, say to build an IOB tagger, you need a couple of small changes
so that the target attribute (which you can think of as read-write)
is accessed from the 'tag' position in your corpus [(token, tag), ...]
and any other attributes (which you can think of as read-only)
are accessed from the 'token' position. For instance:
1) construct your corpus [(token,tag), (token,tag), ...] for IOB tagging
def build_multifeature_corpus():
...
def tuple2dict_featureset(sent, tagnames=("word", "pos", "trueiob")):
return (dict(zip(tagnames, t)) for t in sent)
def tag_tokens(tokens):
return [(t, t["trueiob"]) for t in tokens]
...
2) change the initial tagger accordingly
...
initial_tagger = MyInitialTagger('O')
...
3) modify the feature-extracting class definitions
class POS(tbl.Feature):
#staticmethod
def extract_property(tokens, index):
return tokens[index][0]["pos"]
class IOB(tbl.Feature):
#staticmethod
def extract_property(tokens, index):
return tokens[index][1]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

multiple SpaCy doc objects and want to combine them into one object - python

Related

TypeError in torch.argmax() when want to find the tokens with the highest `start` score

Python sklearn TfidfVectorizer: Vectorize documents ahead of query for semantic search

Doc2Vec not providing adequate results in most_similar

Doc2Vec: tuTypeError: '<' not supported between instances of 'str' and 'int'

Training IOB Chunker using nltk.tag.brill_trainer (Transformation-Based Learning)

Categories

Resources