gensim: custom similarity measure - python

Using gensim, I want to calculate the similarity within a list of documents. This library is excellent at handling the amounts of data that I have got. The documents are all reduced to timestamps and I have got a function time_similarity to compare them. gensim however, uses the cosine similarity.
I am wondering if anyone has attemted this before or has a different solution.

It is possible to do this by inheriting from the interface SimilarityABC. I did not find any documentation for this but it looks like it has been done before to define Word Mover Distance similarity. Here is a generic way to do this. You can likely make it more efficient by specializing to the similarity measure you care about.
import numpy
from gensim import interfaces
class CustomSimilarity(interfaces.SimilarityABC):
def __init__(self, corpus, custom_similarity, num_best=None, chunksize=256):
self.corpus = corpus
self.custom_similarity = custom_similarity
self.num_best = num_best
self.chunksize = chunksize
self.normalize = False
def get_similarities(self, query):
"""
**Do not use this function directly; use the self[query] syntax instead.**
"""
if isinstance(query, numpy.ndarray):
# Convert document indexes to actual documents.
query = [self.corpus[i] for i in query]
if not isinstance(query[0], list):
query = [query]
n_queries = len(query)
result = []
for qidx in range(n_queries):
qresult = [self.custom_similarity(document, query[qidx]) for document in self.corpus]
qresult = numpy.array(qresult)
result.append(qresult)
if len(result) == 1:
# Only one query.
result = result[0]
else:
result = numpy.array(result)
return result
To implement a custom similarity:
def overlap_sim(doc1, doc2):
# similarity defined by the number of common words
return len(set(doc1) & set(doc2))
corpus = [['cat', 'dog'], ['cat', 'bird'], ['dog']]
cs = CustomSimilarity(corpus, overlap_sim, num_best=2)
print(cs[['bird', 'cat', 'frog']])
This outputs [(1, 2.0), (0, 1.0)].

Related

How do I use gensim to vectorize these words in my dataframe so I can perform clustering on them?

I am trying to do a clustering analysis (preferably k-means) of poetry words on a pandas dataframe. I am firstly trying to vectorize the words by using the word-to-vector feature in the gensim package. However, the vectors just come out with 0s, so my code is failing to translate the words into vectors. As a result, the clustering doesn't work. Here is my code:
# create a gensim model
model = gensim.models.Word2Vec(vector_size=100)
# copy original pandas dataframe with poems
data = poems.copy(deep=True)
# get data ready for kmeans clustering
final_data = [] # empty list
for i, row in data.iterrows():
poem_vectorized = []
poem = row['Main_text']
poem_all_words = poem.split(sep=" ")
for poem_w in poem_all_words: #iterate through list of words
try:
poem_vectorized.append(list(model.wv[poem_w]))
except Exception as e:
pass
try:
poem_vectorized = np.asarray(poem_vectorized)
poem_vectorized_mean = list(np.mean(poem_vectorized, axis=0))
except Exception as e:
poem_vectorized_mean = list(np.zeros(100))
pass
try:
len(poem_vectorized_mean)
except:
poem_vectorized_mean = list(np.zeros(100))
temp_row = np.asarray(poem_vectorized_mean)
final_data.append(temp_row)
X = np.asarray(final_data)
print(X)
At closer inspection of:
poem_vectorized.append(list(model.wv[poem_w]))
the problem seems to be this:
If I understand it correctly you want to use an existing model to get the semantic embeddings of the tokens and then cluster the words, right?
Because the way you set the model up you are preparing a new model for training, but then don't feed any training data to it and train it, so your model doesn't know any words and just always throws a KeyError when calling model.wv[poem_w].
Use gensim.downloader to load an existing model (check out their repository for a list of all available models):
import gensim.downloader as api
import numpy as np
import pandas
poems = pandas.DataFrame({"Main_text": ["This is a sample poem.", "This is another sample poem."]})
model = api.load("glove-wiki-gigaword-100")
Then use it to retrieve the vectors for all words the models knows:
final_data = []
for poem in poems['Main_text']:
poem_all_words = poem.split()
poem_vectorized = []
for poem_w in poem_all_words:
if poem_w in model:
poem_vectorized.append(model[poem_w])
poem_vectorized_mean = np.mean(poem_vectorized, axis=0)
final_data.append(poem_vectorized_mean)
Or as list comprehension:
final_data = []
for poem in poems['Main_text']:
poem_vectorized_mean = np.mean([model[poem_w] for poem_w in poem.split() if poem_w in model], axis=0)
final_data.append(poem_vectorized_mean)
Which both will give you:
X = np.asarray(final_data)
print(X)
> [[-3.74696642e-01 3.73661995e-01 4.09943342e-01 -2.07784668e-01
...
-1.85739681e-01 -7.07386672e-01 3.31366658e-01 3.31600010e-01]
[-3.29973340e-01 4.13213342e-01 5.26199996e-01 -2.29261339e-01
...
-1.25366330e-01 -5.87253332e-01 2.80240029e-01 2.56700337e-01]]
Note that attempting to get np.mean() on an empty list will throw an error so you might want to catch that in case there are poems which are empty or where all words are unknown to the model.

Python sklearn TfidfVectorizer: Vectorize documents ahead of query for semantic search

I want to run semantic search using TF-IDF.
This code works, but it is really slow when used on a large corpus of documents:
search_terms = "my query"
documents = ["my","list","of","docs"]
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform([search_terms] + documents)
cosine_similarities = linear_kernel(doc_vectors[0:1], doc_vectors).flatten()
document_scores = [item.item() for item in cosine_similarities[1:]]
It seems quite inefficient:
Every new search query triggers a re-vectorizing of the entire corpus.
I am wondering how I can do the bulk work of vectorizing my corpus ahead of time, saving the result in an "index file". So that, when I run a query, the only thing left to do is to vectorize the few words from the query, and then to calculate similarity.
I tried vectorizing query and documents separately:
vec_docs = vectorizer.fit_transform(documents)
vec_query = vectorizer.fit_transform([search_terms])
cosine_similarities = linear_kernel(vec_query, vec_docs).flatten()
But it gives me this error:
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 3 while Y.shape[1] == 260541
How can I run the corpus vectorization ahead of time without knowing what the query will be?
My main goal is to get blazing fast results even with a large corpus of documents (say, a few GB worth of text), even on a low-powered server, by doing the bulk of the data-crunching ahead of time.
TF/IDF vectors are high-dimensional and sparse. The basic data structure that supports that is an inverted index. You can either implement it yourself or use a standard index (e.g., Lucene).
Nevertheless, if you would like to experiment with modern deep-neural-based vector representations, check out the following semantic search demo. It uses a similarity search service that can handle billions of vectors.
(Note, I am a co-author of this demo.)
You almost have it right.
In this instance, you can get away with fitting (and transforming) your documents and only transforming your search terms. Here is your code, modified accordingly and using the twenty_newsgroups documents (11k) in its place. You can run it as a script and interactively verify you get fast results:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
news = fetch_20newsgroups()
search_terms = "my query"
# documents = ["my", "list", "of", "docs"]
documents = news.data
vectorizer = TfidfVectorizer()
# fit_transform does two things: fits the vectorizer and transforms documents
doc_vectors = vectorizer.fit_transform(documents)
# the vectorizer is already fit; just transform search_terms via vectorizer
search_term_vector = vectorizer.transform([search_terms])
cosine_similarities = linear_kernel(doc_vectors, search_term_vector).flatten()
if __name__ == "__main__":
while True:
query_str = input("\n\n\n\nquery string (return to quit): ")
if not query_str:
print("bye!")
break
search_term_vector = vectorizer.transform([query_str])
cosine_similarities = linear_kernel(doc_vectors, search_term_vector).flatten()
best_idx = np.argmax(cosine_similarities)
best_score = cosine_similarities[best_idx]
best_doc = documents[best_idx]
if best_score < 0.1:
print("no good matches")
else:
max_doc = documents[np.argmax(cosine_similarities)]
print(
f"Best match ({round(best_score, 4)}):\n\n", best_doc[0:200] + "...",
)
Example output:
query string (return to quit): protocol
Best match 0.239 (0.014 sec):
From: ethan#cs.columbia.edu (Ethan Solomita)
Subject: Re: X protocol packet type
Article-I.D.: cs.C52I2q.IFJ
Organization: Columbia University Department of Computer Science
Lines: 7
In article <9309...
Note: this algorithm find the best match(es) at best in O(n_documents) time, compared to Lucene (powers Elasticsearch) that uses skip lists that can search in O(log(n_documents)). Production search engines also have quiet a bit of tuning to optimize performance. The above could be useful with some tweaking but isn't going to topple Google tomorrow :)

NLTK: How to define the "labeled_featuresets" when creating a ClassifierBasedTagger with nltk?

I am playing around with the nltk right now. I am trying to create various Classifiers with nltk, doing named entity recognition, to compare their results. Creating n-gram Taggers was easy, however I have run into some issues creating a ClassifierBasedTagger for Naive Bayes or the Decision Tree Classifier.
My Data is in the conll iob format. After reading it I covert it into a tuple that looks like that: (word, POS-tag), entity)
I have created the following class that creates the Classifiers:
class ClassifierChunker(ChunkParserI):
def __init__(self, trainSents, tagger, **kwargs):
if type(tagger) is not nltk.tag.sequential.UnigramTagger and type(tagger) is not nltk.tag.sequential.BigramTagger and type(tagger) is not nltk.tag.sequential.TrigramTagger:
self.featureDetector = tagger.feature_detector
self.tagger = tagger
def parse(self, sentence):
chunks = self.tagger.tag(sentence)
iobTriblets = [(word, pos, entity) for ((word, pos), entity) in chunks]
return conlltags2tree(iobTriblets)
def evaluate2(self, testSents):
return self.evaluate([conlltags2tree([(word, pos, entity) for (word, pos), entity in iobs]) for iobs in testSents])
Thats how I call it:
#naiveBayers
naiveBayers = NaiveBayesClassifier.train
naiveBayersTagger = ClassifierBasedTagger(train=completeTaggedSentencesTrain, feature_detector=features, classifier_builder=naiveBayers)
nerChunkerNaiveBayers = ClassifierChunker(completeTaggedSentencesTrain, naiveBayersTagger)
evalNaiveBayers = nerChunkerNaiveBayers.evaluate2(completeTaggedSentencesTest)
print(evalNaiveBayers)
The problem I have is in the first line of code (naiveBayers = NaiveBayesClassifier.train) I know I am supposed to pass the train function a labeled feature-set. However I am not exactly sure on what that means.
In the documentation it says the following:
:param labeled_featuresets: A list of classified featuresets,
i.e., a list of tuples (featureset, label).
Would the featureset be the word and the label the entity?
After encountering this problem I have done some research and found the nltk-trainer. There the classifier_builder is created inside the args.py file, more specifically in the inner class "trainf" of the function "make_classifier_builder".
However I have no idea where the variable "train_feats" is coming from. Maybe it has something to do with my limited understanding of inner functions. I can't find it being called anywhere.
I would really appriciate if someone could point me into the right direction.
edit:
I have just read in the NLTK 3 Cookbook that the feature_detector function returns a feature set (p.143). So am I supposed to use that function in some way?
My current feature Detector looks the following and is taken out of that book:
def prev_next_pos_iob(tokens, index, history):
word, pos = tokens[index]
if index == 0:
prevword, prevpos, previob = ('<START>',) * 3
else:
prevword, prevpos = tokens[index - 1]
previob = history[index - 1]
if index == len(tokens) - 1:
nextword, nextpos = ('<END>',) * 2
else:
nextword, nextpos = tokens[index + 1]
feats = {
'word': word,
'pos': pos,
'nextword': nextword,
'nextpos': nextpos,
'prevword': prevword,
'prevpos': prevpos,
'previob': previob
}
return feats

Creating features function for further classification in python

I have read a description, how to apply random forest regression here. In this example the authors use the following code to create the features:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",max_features = 5000)
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = train_data_features.toarray()
I am thinking of combining several possibilities as features and turn them on and off. And I don't know how to do it.
What I have so far is that I define a class, where I will be able to turn on and off the features and see if it brings something (for example, all unigrams and 20 most frequent unigrams, it could be then 10 most frequent adjectives, tf-idf). But for now I don't understand how to combine them together.
The code looks like this, and in the function part I am lost (the kind of function I have would replicate what they do in the tutorial, but it doesn't seem to be really helpful the way I do it):
class FeatureGen: #for example, feat = FeatureGen(unigrams = False) creates feature set without the turned off feature
def __init__(self, unigrams = True, unigrams_freq = True)
self.unigrams = unigrams
self.unigrams_freq = unigrams_freq
def get_features(self, input):
vectorizer = CountVectorizer(analyzer = "word",max_features = 5000)
tokens = input["token"]
if self.unigrams:
train_data_features = vectorizer.fit_transform(tokens)
return train_data_features
What should I do to add one more feature possibility? Like contains 10 most frequent words.
if self.unigrams
train_data_features = vectorizer.fit_transform(tokens)
if self.unigrams_freq:
#something else
return features #and this should be a combination somehow
Looks like you need np.hstack
However you need each features array to have one row per training case.

Training IOB Chunker using nltk.tag.brill_trainer (Transformation-Based Learning)

I'm trying to train a specific chunker (let's say a noun chunker for simplicity) by using NLTK's brill module. I'd like to use three features, ie. word, POS-tag, IOB-tag.
(Ramshaw and Marcus, 1995:7) have shown 100 templates which are generated from the combination of those three features, for example,
W0, P0, T0 # current word, pos tag, iob tag
W-1, P0, T-1 # prev word, pos tag, prev iob tag
...
I want to incorporate them into nltk.tbl.feature, but there are only two kinds of feature objects, ie. brill.Word and brill.Pos. Limited by the design, I could only put word and POS feature together like (word, pos), and thus used ( (word, pos), iob) as features for training. For example,
from nltk.tbl import Template
from nltk.tag import brill, brill_trainer, untag
from nltk.corpus import treebank_chunk
from nltk.chunk.util import tree2conlltags, conlltags2tree
# Codes from (Perkins, 2013)
def train_brill_tagger(initial_tagger, train_sents, **kwargs):
templates = [
brill.Template(brill.Word([0])),
brill.Template(brill.Pos([-1])),
brill.Template(brill.Word([-1])),
brill.Template(brill.Word([0]),brill.Pos([-1])),]
trainer = brill_trainer.BrillTaggerTrainer(initial_tagger, templates, trace=3,)
return trainer.train(train_sents, **kwargs)
# generating ((word, pos),iob) pairs as feature.
def chunk_trees2train_chunks(chunk_sents):
tag_sents = [tree2conlltags(sent) for sent in chunk_sents]
return [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents]
>>> from nltk.tag import DefaultTagger
>>> tagger = DefaultTagger('NN')
>>> train = treebank_chunk.chunked_sents()[:2]
>>> t = chunk_trees2train_chunks(train)
>>> bt = train_brill_tagger(tagger, t)
TBL train (fast) (seqs: 2; tokens: 31; tpls: 4; min score: 2; min acc: None)
Finding initial useful rules...
Found 79 useful rules.
B |
S F r O | Score = Fixed - Broken
c i o t | R Fixed = num tags changed incorrect -> correct
o x k h | u Broken = num tags changed correct -> incorrect
r e e e | l Other = num tags changed incorrect -> incorrect
e d n r | e
------------------+-------------------------------------------------------
12 12 0 17 | NN->I-NP if Pos:NN#[-1]
3 3 0 0 | I-NP->O if Word:(',', ',')#[0]
2 2 0 0 | I-NP->B-NP if Word:('the', 'DT')#[0]
2 2 0 0 | I-NP->O if Word:('.', '.')#[0]
As shown above, (word, pos) are treated one feature as a whole. This is not a perfect capture of three features (word, pos-tag, iob-tag).
Any other ways to implement word, pos, iob features seperately into nltk.tbl.feature?
If it is impossible in NLTK, are there other implementations of them in python? I was only able to find C++ and Java implementations on the internet.
The nltk3 brill trainer api (I wrote it) does handle training on sequences of tokens described with multidimensional
features, as your data is an example of. However, the practical limits may be severe. The number of possible templates in multidimensional learning
increases drastically, and the current nltk implementation of the brill trainer trades memory
for speed, similar to Ramshaw and Marcus 1994, "Exploring the statistical derivation of transformation-rule sequences...".
Memory consumption may be HUGE and
it is very easy to give the system more data and/or templates than
it can handle. A useful strategy is to rank
templates according to how often they produce good rules (see
print_template_statistics() in the example below).
Usually, you can discard the lowest-scoring fraction (say 50-90%)
with little or no loss in performance and a major decrease in training time.
Another or additional possibility is to use the nltk
implementation of Brill's original algorithm, which has very different memory-speed tradeoffs; it does no indexing and so will use much less memory. It uses some optimizations and is actually rather quick in finding the very best rules, but is generally extremely slow towards end of training when there are many competing, low-scoring candidates. Sometimes you don't need those, anyway. For some reason this implementation seems to have been omitted from newer nltks, but here is the source (I just tested it) http://www.nltk.org/_modules/nltk/tag/brill_trainer_orig.html.
There are other algorithms with other tradeoffs, and
in particular the fast memory-efficient indexing algorithms of Florian and Ngai 2000
(http://www.aclweb.org/anthology/N/N01/N01-1006.pdf) and
probabilistic rule sampling of Samuel 1998
(https://www.aaai.org/Papers/FLAIRS/1998/FLAIRS98-045.pdf) would be a useful additions. Also, as you noticed, the documentation is not complete and too much focused on part-of-speech tagging, and it is not clear how to generalize from it. Fixing the docs is (also) on the todo list.
However, the interest for generalized (non-POS-tagging) tbl in nltk has been rather limited (the totally unsuited api of nltk2 was untouched for 10 years), so don't hold your breath. If you get impatient, you may wish to check out more dedicated alternatives,
in particular mutbl and fntbl (google them, I only have reputation for two links).
Anyway, here is a quick sketch for nltk:
First, a hardcoded convention in nltk is that tagged sequences ('tags' meaning any label
you would like to assign to your data, not necessarily part-of-speech) are represented
as sequences of pairs, [(token1, tag1), (token2, tag2), ...]. The tags are strings; in
many basic applications, so are the tokens. For instance, the tokens may be words
and the strings their POS, as in
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
(As an aside, this sequence-of-token-tag-pairs convention is pervasive in nltk and
its documentation, but it should arguably be better expressed as named tuples
rather than pairs, so that instead of saying
[token for (token, _tag) in tagged_sequence]
you could say for instance
[x.token for x in tagged_sequence]
The first case fails on non-pairs, but the second exploits duck typing so
that tagged_sequence could be any sequence of user-defined instances, as long as
they have an attribute "token".)
Now, you could well have a richer representation of what a token is at your
disposal. An existing tagger interface (nltk.tag.api.FeaturesetTaggerI) expects
each token as a featureset rather than a string, which is a dictionary that maps
feature names to feature values for each item in the sequence.
A tagged sequence may then look like
[({'word': 'Pierre', 'tag': 'NNP', 'iob': 'B-NP'}, 'NNP'),
({'word': 'Vinken', 'tag': 'NNP', 'iob': 'I-NP'}, 'NNP'),
({'word': ',', 'tag': ',', 'iob': 'O' }, ','),
...
]
There are other possibilities (though with less support in the rest of nltk).
For instance, you could have a named tuple for each token, or a user-defined
class which allows you to add any amount of dynamic calculation to
attribute access (perhaps using #property to offer a consistent interface).
The brill tagger doesn't need to know what view you currently provide
on your tokens. However, it does require you to provide an initial tagger
which can take sequences of tokens-in-your-representation to sequences of
tags. You cannot use the existing taggers in nltk.tag.sequential directly,
since they expect [(word, tag), ...]. But you may still be able to
exploit them. The example below uses this strategy (in MyInitialTagger), and the token-as-featureset-dictionary view.
from __future__ import division, print_function, unicode_literals
import sys
from nltk import tbl, untag
from nltk.tag.brill_trainer import BrillTaggerTrainer
# or:
# from nltk.tag.brill_trainer_orig import BrillTaggerTrainer
# 100 templates and a tiny 500 sentences (11700
# tokens) produce 420000 rules and uses a
# whopping 1.3GB of memory on my system;
# brill_trainer_orig is much slower, but uses 0.43GB
from nltk.corpus import treebank_chunk
from nltk.chunk.util import tree2conlltags
from nltk.tag import DefaultTagger
def get_templates():
wds10 = [[Word([0])],
[Word([-1])],
[Word([1])],
[Word([-1]), Word([0])],
[Word([0]), Word([1])],
[Word([-1]), Word([1])],
[Word([-2]), Word([-1])],
[Word([1]), Word([2])],
[Word([-1,-2,-3])],
[Word([1,2,3])]]
pos10 = [[POS([0])],
[POS([-1])],
[POS([1])],
[POS([-1]), POS([0])],
[POS([0]), POS([1])],
[POS([-1]), POS([1])],
[POS([-2]), POS([-1])],
[POS([1]), POS([2])],
[POS([-1, -2, -3])],
[POS([1, 2, 3])]]
iobs5 = [[IOB([0])],
[IOB([-1]), IOB([0])],
[IOB([0]), IOB([1])],
[IOB([-2]), IOB([-1])],
[IOB([1]), IOB([2])]]
# the 5 * (10+10) = 100 3-feature templates
# of Ramshaw and Marcus
templates = [tbl.Template(*wdspos+iob)
for wdspos in wds10+pos10 for iob in iobs5]
# Footnote:
# any template-generating functions in new code
# (as opposed to recreating templates from earlier
# experiments like Ramshaw and Marcus) might
# also consider the mass generating Feature.expand()
# and Template.expand(). See the docs, or for
# some examples the original pull request at
# https://github.com/nltk/nltk/pull/549
# ("Feature- and Template-generating factory functions")
return templates
def build_multifeature_corpus():
# The true value of the target fields is unknown in testing,
# and, of course, templates must not refer to it in training.
# But we may wish to keep it for reference (here, truepos).
def tuple2dict_featureset(sent, tagnames=("word", "truepos", "iob")):
return (dict(zip(tagnames, t)) for t in sent)
def tag_tokens(tokens):
return [(t, t["truepos"]) for t in tokens]
# connlltagged_sents :: [[(word,tag,iob)]]
connlltagged_sents = (tree2conlltags(sent)
for sent in treebank_chunk.chunked_sents())
conlltagged_tokenses = (tuple2dict_featureset(sent)
for sent in connlltagged_sents)
conlltagged_sequences = (tag_tokens(sent)
for sent in conlltagged_tokenses)
return conlltagged_sequences
class Word(tbl.Feature):
#staticmethod
def extract_property(tokens, index):
return tokens[index][0]["word"]
class IOB(tbl.Feature):
#staticmethod
def extract_property(tokens, index):
return tokens[index][0]["iob"]
class POS(tbl.Feature):
#staticmethod
def extract_property(tokens, index):
return tokens[index][1]
class MyInitialTagger(DefaultTagger):
def choose_tag(self, tokens, index, history):
tokens_ = [t["word"] for t in tokens]
return super().choose_tag(tokens_, index, history)
def main(argv):
templates = get_templates()
trainon = 100
corpus = list(build_multifeature_corpus())
train, test = corpus[:trainon], corpus[trainon:]
print(train[0], "\n")
initial_tagger = MyInitialTagger('NN')
print(initial_tagger.tag(untag(train[0])), "\n")
trainer = BrillTaggerTrainer(initial_tagger, templates, trace=3)
tagger = trainer.train(train)
taggedtest = tagger.tag_sents([untag(t) for t in test])
print(test[0])
print(initial_tagger.tag(untag(test[0])))
print(taggedtest[0])
print()
tagger.print_template_statistics()
if __name__ == '__main__':
sys.exit(main(sys.argv))
The setup above builds a POS tagger. If you instead wish to target another attribute, say to build an IOB tagger, you need a couple of small changes
so that the target attribute (which you can think of as read-write)
is accessed from the 'tag' position in your corpus [(token, tag), ...]
and any other attributes (which you can think of as read-only)
are accessed from the 'token' position. For instance:
1) construct your corpus [(token,tag), (token,tag), ...] for IOB tagging
def build_multifeature_corpus():
...
def tuple2dict_featureset(sent, tagnames=("word", "pos", "trueiob")):
return (dict(zip(tagnames, t)) for t in sent)
def tag_tokens(tokens):
return [(t, t["trueiob"]) for t in tokens]
...
2) change the initial tagger accordingly
...
initial_tagger = MyInitialTagger('O')
...
3) modify the feature-extracting class definitions
class POS(tbl.Feature):
#staticmethod
def extract_property(tokens, index):
return tokens[index][0]["pos"]
class IOB(tbl.Feature):
#staticmethod
def extract_property(tokens, index):
return tokens[index][1]

Categories

Resources