I'm having a problem to make at least one functional machine learning model, the examples I found all over the network are either off topic or good but incomplete (missing dataset, explanations...).
The closest example related to my problem is this.
I'm trying to create a model based on accelerometer & gyroscope sensor, each one has its own 3 axis, for example if I lift the sensor parallel to the gravity then return it back to his initial position, then I should have a table like this.
Example
Now this whole table correspond to one movement which I call it "Fade_away", and the duration for this same movement is variable.
I have only two main questions:
In which format I need to save my dataset, because I don't think an array could arrange this kind of data?
How can I implement a simple model at least with one hidden layer?
To make it easier, let's say that I have 3 outputs, "Fade_away", "Punch" and "Rainbow".
I have a "corpus" built from an item-item graph, which means each sentence is a graph walk path and each word is an item. I want to train a word2vec model upon the corpus to obtain items' embedding vectors. The graph is updated everyday so the word2vec model is trained in an increased way (using Word2Vec.save() and Word2Vec.load()) to keep updating the items' vectors.
Unlike words, the items in my corpus have their lifetime and there will be new items added in everyday. In order to prevent the constant growth of the model size, I need to drop items that reached their lifetime while keep the model trainable. I've read the similar question
here, but this question's answer doesn't related to increased-training and is based on KeyedVectors. I come up with the below code, but I'm not sure if it is correct and proper:
from gensim.models import Word2Vec
import numpy as np
texts = [["a", "b", "c"], ["a", "h", "b"]]
m = Word2Vec(texts, size=5, window=5, min_count=1, workers=1)
print(m.wv.index2word)
print(m.wv.vectors)
# drop old words
wordsToDrop = ["b", "c"]
for w in wordsToDrop:
i = m.wv.index2word.index(w)
m.wv.index2word.pop(i)
m.wv.vectors = np.delete(m.wv.vectors, i, axis=0)
del m.wv.vocab[w]
print(m.wv.index2word)
print(m.wv.vectors)
m.save("m.model")
del m
# increased training
new = [["a", "e", "n"], ["r", "s"]]
m = Word2Vec.load("m.model")
m.build_vocab(new, update=True)
m.train(new, total_examples=m.corpus_count, epochs=2)
print(m.wv.index2word)
print(m.wv.vectors)
After deleting and increased training, is the m.wv.index2word and m.wv.vectors still element-wise corresponding? Is there any side-effect of above code? If my way is not good, could someone give me an example to show how to drop the old "words" properly and keep the model trainable?
There's no official support for removing words from a Gensim Word2Vec model, once they've ever "made the cut" for inclusion.
Even the ability to add words isn't on a great footing, as the feature isn't based on any proven/published method of updating a Word2Vec model, and glosses over difficult tradeoffs in how update-batches affect the model, via choice of learning-rate or whether the batches fully represent the existing vocabulary. The safest course is to regularly re-train the model from scratch, with a full corpus with sufficient examples of all relevant words.
So, my main suggestion would be to regularly replace your model with a new one trained with all still-relevant data. That would ensure it's no longer wasting model state on obsolete terms, and that all still-live terms have received coequal, interleaved training.
After such a reset, word-vectors won't be comparable to word-vectors from a prior 'model era'. (The same word, even if its tangible meaning hasn't changed, could be an arbitrarily different place - but the relative relationships with other vectors should remain as good or better.) But, that same sort of drift-out-of-comparison is also happening with any set of small-batch updates that don't 'touch' every existing word equally, just at some unquantifiable rate.
OTOH, if you think you need to stay with such incremental updates, even knowing the caveats, it's plausible that you could patch-up the model structures to retain as much as is sensible from the old model & continue training.
Your code so far is a reasonable start, missing a few important considerations for proper functionality:
because deleting earlier-words changes the index location of later-words, you'd need to update the vocab[word].index values for every surviving word, to match the new index2word ordering. For example, after doing all deletions, you might do:
for i, word in enumerate(m.wv.index2word):
m.wv.vocab[word].index = i
because in your (default negative-sampling) Word2Vec model, there is also another array of per-word weights related to the model's output layer, that should also be updated in sync, so that the right output-values are being checked per word. Roughly, wheneever you delete a row from m.wv.vectors, you should delete the same row from m.traininables.syn1neg.
because the surviving vocabulary has different relative word-frequencies, both the negative-sampling and downsampling (controlled by the sample parameter) functions should work off different pre-calculated structures to assist their choices. For the cumulative-distribution table used by negative-sampling, this is pretty easy:
m.make_cum_table(m.wv)
For the downsampling, you'd want to update the .sample_int values similar to the logic you can view around the code at https://github.com/RaRe-Technologies/gensim/blob/3.8.3/gensim/models/word2vec.py#L1534. (But, looking at that code now, I think it may be buggy in that it's updating all words with just the frequency info in the new dict, so probably fouling the usual downsampling of truly-frequent words, and possibly erroneously downsampling words that are only frequent in the new update.)
If those internal structures are updated properly in sync with your existing actions, the model is probably in a consistent state for further training. (But note: these structures change a lot in the forthcoming gensim-4.0.0 release, so any custom tampering will need to be updated when upgrading then.)
One other efficiency note: the np.delete() operation will create a new array, the full size of the surviving array, and copy the old values over, each time it is called. So using it to remove many rows, one at a time, from a very-large original array is likely to require a lot of redundant allocation/copying/garbage-collection. You may be able to call it once, at the end, with a list of all indexes to remove.
But really: the simpler & better-grounded approach, which may also yield significantly better continually-comparable vectors, would be to retrain with all current data whenever possible or a large amount of change has occurred.
I'm working on training a sequential labeling model in Python Flair. My raw text data has concept phrases that I want the model to be able to identify that are in some cases represented by a set of tokens that are not continuous, with words in between. An example is "potassium and magnesium replacement" where "potassium replacement" is one concept that is represented by discontinuous tokens, and "magnesium replacement" is another concept which is continuous yet overlaps the first.
I trained another Flair model where all concepts could be represented by a single token, and building corpus CoNLL files for that data was pretty straight forward. In the case, the discontinuous and overlapping concepts bring up 3 questions:
Does Flair sequential labeling model recognize multi-token concepts, like "magnesium replacement" as a single concept, if I mark it appropriately in CoNLL file as:
"magnesium B-CONC1
replacement I-CONC1"
Does it recognize discontinuous concepts as "potassium replacement" in the phrase above:
"potassium B-CONC2
and O
magnesium O
replacement I-CONC2"
How can I represent overlapping concepts in CoNLL file? Is there some alternative way of representing corpus with raw text and a list of start/end indices?
PS It must be pretty clear in the context, but by word concept, I mean a single- or multi-token tag/term that I'm trying to train the model to identify.
I appreciate your advice or information
Flair does not support discontinuous and overlapping annotations.
See more at https://github.com/zalandoresearch/flair/issues/824#issuecomment-504322361
I read somewhere around here that running multiple Tensorflow graphs in a single process is considered bad practice. Therefore, I now have a single graph which consists of multiple separate "sub-graphs" of the same structure. Their purpose is to generate specific models that describe production tolerances of multiple sensors of the same type. The tolerances are different for each sensor.
I'm trying to use TF to optimize a loss function in order to come up with a numerical description (i.e. a tensor) of that production tolerance for each sensor separately.
In order to achieve that and avoid having to deal with multiple graphs (i.e. avoid bad practice), I built a graph that contains a distinct sub-graph for each sensor.
The problem is that I only get data from a single sensor at a time. So, I cannot build a feed_dict that has all placeholders for all sub-graphs filled with numbers (all zeros wouldn't make sense).
TF now complains about missing values for certain placeholders, namely those of the other sensors that I don't have yet. So basically I would like to calculate a sub-graph without feeding the other sub-graphs.
Is that at all possible and, if yes, what will I have to do in order to hand an incomplete feed_dict to the graph?
If it's not possible to train only parts of a graph, even if they have no connection to other parts, what's the royal road to create models with the same structure but different weights that can get trained separately but don't use multiple graphs?
Very briefly, two-three basic questions about the minimize_nested_blockmodel_dl function in graph-tool library. Is there a way to figure out which vertex falls onto which block? In other words, to extract a list from each block, containing the labels of its vertices.
The hierarchical visualization is rather difficult to understand for amateurs in network theory, e.g. are the squares with directed edges that are drawn meant to implicate the main direction of the underlying edges between two blocks under consideration? The blocks are nicely shown using different colors, but on a very conceptual level, which types of patterns or edge/vertex properties are behind the block categorization of vertices? In other words, when two vertices are in the same block, what can I say about their common properties?
Regarding your first question, it is fairly straightforward: The minimize_nested_blockmodel_dl() function returns a NestedBlockState object:
g = collection.data["football"]
state = minimize_nested_blockmodel_dl(g)
you can query the group membership of the nodes by inspecting the first level of the hierarchy:
lstate = state.levels[0]
This is a BlockState object, from which we get the group memberships via the get_blocks() method:
b = lstate.get_blocks()
print(b[30]) # prints the group membership of node 30
Regarding your second question, the stochastic block model assumes that nodes that belong to the same group have the same probability of connecting to the rest of the network. Hence, nodes that get classified in the same group by the function above have similar connectivity patterns. For example, if we look at the fit for the football network:
state.draw(output="football.png")
We see that nodes that belong to the same group tend to have more connections to other nodes of the same group --- a typical example of community structure. However, this is just one of the many possibilities that can be uncovered by the stochastic block model. Other topological patterns include core-periphery organization, bipartiteness, etc.