How to add OOV terms in a word embeddings model

How to add OOV terms in a word embeddings model - python

I am using a word embeddings model (FastText via Gensim library) to expand the terms of a search.
So, basically if the user write "operating system" my goal is to expand that term with very similar terms like "os", "windows", "ubuntu", "software" and so on.
The model works very well but now the time has come to improve the model with "external information", with "external information" i mean OOV (out-of-vocabulary) terms OR terms that do not have good context.
Following the example i wrote above when the user writes operating system i would like to expand the query with the "general" terms:
Terms built in the FastText model:
windows
ubuntu
software
AND
terms that represent (organizations/companies) like "Microsoft", "Apple" so the complete query will be:
term: operating system
query: operating system, os, software, windows, ios, Microsoft, Apple
My problem is that i DO NOT have companies inside the corpus OR, if present, i do not have to much context to "link" Microsoft to "operating system".
For example if i extract a piece inside the corpus i can read "... i have started working at Microsoft in November 2000 with my friend John ..." so, as you can see, i cannot contextualize "Microsoft" word because i do not have good context, indeed.
A small recap:
I have a corpus where the companies (terms) have poor context
I have a big database with companies and the description of what they do.
What i need to do:
I would like to include the companies in my FastText model and set "manually" their words context/cloud of related terms.
Ideas?

There is no easy way how to do it. The FastText algorithm uses character-level information, so it can infer embeddings for unseen words. This is what the FastText paper says about representing the words:
However, this makes sense only in the case of words where you can infer what they mean from knowing the parts. E.g., if you had a reliable embedding for "walk", but not for "walking" and there were plenty of words ending with "ing", FastText would be able to infer the embedding. But this obviously cannot work with words like "Microsoft".
The best thing you can do is train your embeddings on data that contain the words you want the model work with of genre as similar as possible. If your text is in English, tt should not be too difficult.

These kinds of models need numerous, varied usage examples to place a token in a relatively good place, at meaningful distances/directions from other related tokens. If you don't have such examples, or your examples are few/poor, there's little way the algorithm can help.
If you somehow know, a priori, that 'microsoft' should appear in some particular vector coordinates, then sure, you could patch the model to include that word->vector mapping. (Though, such model classes often don't include convenient methods for such incremental additions, because it's expected words are trained in bulk from corpuses, not dictated individually.)
But if you don't have example text for some range of tokens, like company names, you probably don't have independent ideas of where those tokens should be, either.
Really, you need to find adequate training data. And then, assuming you want the vectors for these new terms to be in the "same space" and comparable to your existing word-vectors, combine that with your prior data, and training all the data together into one combined model. (And further, for an algorithm like FastText to synthesize reasonable guess-vectors for never-before-seen OOV words, it needs lots of examples of words which have overlapping meanings and overlapping character-n-gram fragments.)
Expanding your corpus to have better training data for, say, 100 target organization names might be as simple as scraping sentences/paragraphs including those names from available sources, like Wikipedia or the web.
By gathering dozens (or even better hundreds or thousands) of independent examples of the organization names in real language contexts, and because those contexts include many mutually-shared other words, or names of yet other related organizations, you'd be able induce reasonable vectors for those terms, and related terms.

Related

Automatically Add Diacritic/Accent Marks to a Non-English Document

In my spare time, I am transcribing a very old, rare book written in Romanian (in fact, it is the only remaining copy, to my knowledge). It was written over a hundred years ago, well before any computers existed. As such, no digital copies exist, and I am manually transcribing and digitizing it.
The book is thousands of pages long, and it is surprisingly time consuming (for me, at least) to add diacritic and accented marks (ă/â/î/ş/ţ) to every single word as I type. If I omit the marks and just type the bare letters (i.e a instead of ă/â), I am able to type more than twice as fast, which is a huge benefit. Currently I am typing everything directly into a .tex file to apply special formatting for the pages and illustrations.
However, I know that eventually I will have to add all these marks back into the text, and it seems tedious/unecessary to do all that manually, since I already have all the letters. I'm looking for some way to automatically/semi-automatically ADD diacritic/accent marks to a large body of text (not remove - I see plenty of questions asking how to remove the marks on SO).
I tried searching for large corpora of Romanian words (this and this were the most promising two), but everything I found fell short, missing at least a few words on any random sample of text I fed it (I used a short python script). It doesn't help that the book uses many archaic/uncommon words or uncommon spellings of words.
Does anyone have any ideas on how I might go about this? There are no dumb ideas here - any document format, machine learning technique, coding language, professional tool, etc that you can think of that might help is appreciated.
I should also note that I have substantial coding experience, and would not consider it a waste of time to build something myself. Tbh, I think it might be beneficial to the community, since I could not find such a tool in any western language (french, czech, serbian, etc). Just need some guidance on how to get started.

What comes to my mind is a simple replacement. About 10% of the words could are differentiated only by the diacritics, e.g. abandona and abandonă, those will not be fixed. But the other 90% will be fixed.
const dictUrl = 'https://raw.githubusercontent.com/ManiacDC/TypingAid/master/Wordlists/Wordlist%20Romanian.txt';
async function init(){
console.log('init')
const response = await fetch(dictUrl);
const text = await response.text();
console.log(`${text.length} characters`)
const words = text.split(/\s+/mg);
console.log(`${words.length} words`)
const denormalize = {}
let unique_count = 0;
for(const w of words){
const nw = w.normalize('NFD').replace(/[^a-z]/ig, '')
if(!Object.hasOwnProperty.call(denormalize, nw)){
denormalize[nw] = [];
unique_count += 1;
}
denormalize[nw].push(w);
}
console.log(`${unique_count} unique normalized words`)
for(const el of document.querySelectorAll('textarea')){
handleSpellings(el, denormalize);
}
}
function handleSpellings(el, dict){
el.addEventListener("keypress", function (e) {
if(e.key == ' ')
setTimeout(function () {
const restored = el.value.replace(
/\b\S+(?=[\x20-\x7f])/g,
(s) => {
const s2 = dict[s] ? dict[s][0] : s;
console.log([s, dict[s], s2]);
return s2;
}
);
el.value = restored;
}, 0)
})
}
window.addEventListener('load', init);
<body>
<textarea width=40 height=10 style="width: 40em; height:10em;">
</textarea>
</body>

Bob's answer is a static approach which will work depending on how good the word-list is.
So if a word is missing from this list it will never handled.
Moreover, as in many other languages, there are cases where two (or more) words exists with the same characters but different diacritics.
For Romanian I found the following example: peste = over vs. pesţe = fish.
These cases cannot be handled in a straightforward way either.
This is especially an issue, if the text you're converted contains words which aren't used anymore in today's language, especially diacritised ones.
In this answer I will present an alternative using machine learning.
The only caveat to this is that I couldn't find a publicly available trained model doing diacritic restoration for Romanian.
You may find some luck in contacting the authors of the papers I will mention here to see if they'd be willing to send their trained models for you to use.
Otherwise, you'll have to train yourself, which I'll give some pointers on.
I will try to give a comprehensive overview to get you started, but further reading is encouraged.
Although this process may be laborious, it can give you 99% accuracy with the right tools.
Language Model
The language model is a model which can be thought of as having a high-level "understanding" of the language.
It's typically pre-trained on raw text corpora.
Although you can train your own, be wary that these models are quite expensive to pre-train.
Whilst multilingual models can be used, language-specific models typically fare better if trained with enough data.
Luckily, there are publicly language models available for Romanian, such as RoBERT.
This language model is based on BERT, an architecture used extensively in Natural Language Processing & is more or less the standard in the field due as it attained state-of-the-art results in English & other languages.
In fact there are three variants: base, large, & small.
The larger the model, the better the results, due to the larger representation power.
But larger models will also have a higher footprint in terms of memory.
Loading these models is very easy with the transformers library.
For instance, the base model:
from transformers import AutoModel, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-base")
model = AutoModel.from_pretrained("readerbench/RoBERT-base")
inputs = tokenizer("exemplu de propoziție", return_tensors="pt")
outputs = model(**inputs)
The outputs above will contain vector representations of the inputted texts, more commonly know as "word embeddings".
Language models are then fine-tuned to a downstream task — in your case, diacritic restoration — and would take these embeddings as input.
Fine-tuning
I couldn't find any publicly available fine-tuned models, so you'll have to fine-tune your own unless you find a model yourself.
To fine-tune a language model, we need to build a task-specific architecture which will be trained on some dataset.
The dataset is used to tell the model how the input is & how we'd like the output to be.
Dataset
From Diacritics Restoration using BERT with Analysis on Czech language, there's a publicly available dataset for a number of languages including Romanian.
The dataset annotations will also depend on which fine-tuning architecture you use (more on that below).
In general, you'd choose a dataset which you trust has a high-quality of of diacritics.
From this text you can then build annotations automatically by producing the undiacritised variants of the words as well as the corresponding labels.
Keep in mind that this or any other dataset you'll use will contain biases especially in terms of the domain the annotated texts originate from.
Depending on how much data you have already transcribed, you may also want to build a dataset using your texts.
Architecture
The architecture you choose will have a bearing on the downstream performance you use & the amount of custom code you'll have to do.
Word-level
The aforementioned work, Diacritics Restoration using BERT with Analysis on Czech language, use a token-level classification mechanism where each word is is labelled with a set of instructions of the type of diacritic marks to insert at which character index.
For example, the undiacritised word "dite" with instruction set 1:CARON;3:ACUTE indicates adding the appropriate diacritic marks at index 1 and index 3 to result in "dítě".
Since this is a token-level classification task, there's not much custom code you have to do, as you can directly use a BertForTokenClassification.
Refer to the authors' code for a more complete example.
One sidenote is that the authors use a multililingual language model.
This can be easily replaced with another language model such as RoBERT mentioned above.
Character-level
Alternatively, the RoBERT paper use a character-level model.
From the paper, each character is annotated as one of the following:
make no modification to the current character (e.g., a → a), add circumflex mark (e.g., a → â and i → î), add breve mark (e.g., a → ̆ă), and two more classes for adding comma below (e.g., s → ş and t → ţ)
Here you will have to build your own custom model (instead of the BertForTokenClassification above).
But, the rest of the training code will largely be the same.
Here's a template for the model class which can be used using the transformers library:
from transformers import BertModel, BertPreTrainedModel
class BertForDiacriticRestoration(BertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.bert = BertModel(config)
...
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None
):
...
Evaluation
In each section there's plethora of options for you to choose from.
A bit of pragmatic advice I'll offer is to start simple & complicate things if you want to improve things further.
Keep a testing set to measure if the changes you're making result in improvements or degradation over your previous setup.
Crucially, I'd suggest that at least a small part of your testing set is texts coming from the texts you have transcribed yourself, the more you use the better.
Primarily, this is data you annotated yourself, so you are more sure of the quality of this data then any other publicly available source.
Secondly, when you are testing on data coming from the target domain, you stand a better chance of more accurately evaluating your systems more accurately to your target task, due to certain biases which might be present from other domains.

What is the best approach to measure a similarity between texts in multiple languages in python?

So, I have a task where I need to measure the similarity between two texts. These texts are short descriptions of products from a grocery store. They always include a name of a product (for example, milk), and they may include a producer and/or size, and maybe some other characteristics of a product.
I have a whole set of such texts, and then, when a new one arrives, I need to determine whether there are similar products in my database and measure how similar they are (on a scale from 0 to 100%).
The thing is: the texts may be in two different languages: Ukrainian and Russian. Also, if there is a foreign brand (like, Coca Cola), it will be written in English.
My initial idea on solving this task was to get multilingual word embeddings (where similar words in different languages are located nearby) and find the distance between those texts. However, I am not sure how efficient this will be, and if it is ok, what to start with.
Because each text I have is just a set of product characteristics, some word embeddings based on a context may not work (I'm not sure in this statement, it is just my assumption).
So far, I have tried to get familiar with the MUSE framework, but I encountered an issue with faiss installation.
Hence, my questions are:
Is my idea with word embeddings worth trying?
Is there maybe a better approach?
If the idea with word embeddings is okay, which ones should I use?
Note: I have Windows 10 (in case some libraries don't work on Windows), and I need the library to work with Ukrainian and Russian languages.
Thanks in advance for any help! Any advice would be highly appreciated!

You could try Milvus that adopted Faiss to search similar vectors. It's easy to be installed with docker in windows OS.

Word embedding is meaningful inside the language but can't be transferrable to other languages. An observation for this statement is: if two words co-occur with a lot inside sentences, their embeddings can be near each other. Hence, as there is no one-to-one mapping between two general languages, you cannot compare word embeddings.
However, if two languages are similar enough to one-to-one mapping words, you may count on your idea.
In sum, without translation, your idea is not applicable to two general languages anymore.

Does the data contain lots of numerical information (e.g. nutritional facts)? If yes, this could be used to compare the products to some extent. My advice is to think of it not as a linguistic problem, but pattern matching as these texts have been assumably produced using semi-automatic methods using translation memories. Therefore similar texts across languages may have similar form and if so this should be used for comparison.
Multilingual text comparison is not a trivial task and I don't think there are any reasonably good out-of-box solutions for that. Yes, multilingual embeddings exist, but they have to be fine-tuned to work on specific downstream tasks.

Let's say that your task is about a fine-grained entity recognition. I think you have a well defined entities: brand, size etc...
So, these features that defines a product each could be a vector, which means your products could be represented with a matrix.
You can potentially represent each feature with an embedding.
Or mixture of the embedding and one-hot vectors.
Here is how.
Define a list of product features:
product name, brand name, size, weight.
For each product feature, you need a text recognition model:
E.g. with brand recognition you find what part of the text is its brand name.
Use machine translation if it is possible to make unified language representation for all sub texts. E.g. Coca Cola to
ru Кока-Кола, en Coca Cola.
Use contextual embeddings (i.e. huggingface multilingial BERT or something better) to convert prompted text into one vector.
In order to compare two products, compare their feature vectors: what is the average similarity between two feature array. You can also decide what is the weight on each feature.
Try other vectorization methods. Perhaps you dont want to mix brand knockoffs: "Coca Cola" is similar to "Cool Cola". So, maybe embeddings aren't good for brand names and size and weight but good enough for product names. If you want an exact match, you need a hash function for their text. On their multi-lingual prompt-engineered text.
You can also extend each feature vectors, with concatenations of several embeddings or one hot vector of their source language and things like that.
There is no definitive answer here, you need to experiment and test to see what is the best solution. You cam create a test set and make benchmarks for your solutions.

Methods to extract keywords from large documents that are relevant to a set of predefined guidelines using NLP/ Semantic Similarity

I'm in need of suggestions how to extract keywords from a large document. The keywords should be inline what we have defined as the intended search results.
For example,
I need the owner's name, where the office is situated, what the operating industry is when a document about a company is given, and the defined set of words would be,
{owner, director, office, industry...}-(1)
the intended output has to be something like,
{Mr.Smith James, ,Main Street, Financial Banking}-(2)
I was looking for a method related to Semantic Similarity where sentences containing words similar to the given corpus (1), would be extracted, and using POS tagging to extract nouns from those sentences.
It would be a useful if further resources could be provided that support this approach.

What you want to do is referred to as Named Entity Recognition.
In Python there is a popular library called SpaCy that can be used for that. The standard models are able to detect 18 different entity types which is a fairly good amount.
Persons and company names should be extracted easily, while whole addresses and the industry might be more difficult. Maybe you would have to train your own model on these entity types. SpaCy also provides an API for training your own models.
Please note, that you need quite a lot of training data to have decent results. Start with 1000 examples per entity type and see if it's sufficient for your needs. POS can be used as a feature.
If your data is unstructured, this is probably one of most suited approaches. If you have more structured data, you could maybe take advantage of that.

How to automatically label a cluster of words using semantics?

The context is : I already have clusters of words (phrases actually) resulting from kmeans applied to internet search queries and using common urls in the results of the search engine as a distance (co-occurrence of urls rather than words if I simplify a lot).
I would like to automatically label the clusters using semantics, in other words I'd like to extract the main concept surrounding a group of phrases considered together.
For example - sorry for the subject of my example - if I have the following bunch of queries : ['my husband attacked me','he was arrested by the police','the trial is still going on','my husband can go to jail for harrassing me ?','free lawyer']
My study deals with domestic violence, but clearly this cluster is focused on the legal aspect of the problem so the label could be "legal" for example.
I am new to NPL but I have to precise that I don't want to extract words using POS tagging (or at least this is not the expected final outcome but maybe a necessary preliminary step).
I read about Wordnet for sense desambiguation and I think that might be a good track, but I don't want to calculate similarity between two queries (since the clusters are the input) nor obtain the definition of one selected word thanks to the context provided by the whole bunch of words (which word to select in this case ?). I want to use the whole bunch of words to provide a context (maybe using synsets or categorization with the xml structure of the wordnet) and then summarize the context in one or few words.
Any ideas ? I can use R or python, I read a little about nltk but I don't find a way to use it in my context.

Your best bet is probably is to label the clusters manually, especially if there are few of them. This a difficult problem even for humans to solve, because you might need a domain expert. Anyone claiming they could do that automatically and reliably (except in some very limited domains) is probably running a startup and trying to get your business.
Also, going through the clusters yourself will have benefits. 1) you may discover you had the wrong number of clusters (k parameter) or that there was too much junk in the input to begin with. 2) you will gain qualitative insight into what is being talked about and what topic there are in the data (which you probably can't know before looking at the data). Therefore, label manually if qualitative insight is what you are after. If you need quantitative result too, you could then train a classifier on the manually labelled topics to 1) predict topics for the rest of the clusters, or 2) for future use, if you repeat the clustering, get new data, ...

When we talk about semantics in this area we mean Statistical Semantics. The statistical or distributional semantics is very different from other definitions of semantics which has logic and reasoning behind it. Statistical semantics is based on Distributional Hypothesis, which considers context as meaning aspect of words and phrases. Meaning in very abstract and general sense in different litterers is called topics. There are several unsupervised methods for modelling topics, such as LDA or even word2vec, which basically provide word similarity metric or suggest a list of similar words for a document as another context. Usually when you have these unsupervised clusters, you need a domain expert to tell the meaning of each cluster.
However, for several reasons you might accept low accuracy assignment of a word as the general topic (or as in your words "global semantic") to a list of phrases. If this is the case, I would suggest to take a look at Word Sense Disambiguation tasks which look for coarse grained word senses. For WordNet, it might be called supersense tagging task.
This paper worth to take a look: More or less supervised supersense tagging of Twitter
And about your question about choosing words from current phrases, there is also an active question about "converting phrase to vectors", my answer to that question in word2vec fashion might be useful:
How can a sentence or a document be converted to a vector?
I can add more related papers later if it comes to my mind.

The paper Automatic Labelling of Topic Models explains the author's approach to this problem. To provide an overview I can tell you that they generate some label candidates using the information retrieved from Wikipedia and Google, and once they have the list of candidates in place they rank those candidates to find the best label.
I think the code is not available online, but I have not looked for it.

The package chowmein claims to do this in python using the algorithm outlined in Automatic Labeling of Multinomial Topic Models.

One possible approach, which the below papers suggest is identifying the set of keywords from the cluster, getting all the synonyms and then finding the hypernyms for each synonym.
The idea is to get a more abstract meaning for the cluster by using the hypernym.
Example: A word cluster containing words dog and wolf should not be labelled with either word but as canids. They achieve it using synonymy and hypernymy.
Cluster Labeling by Word Embeddings
and WordNet’s Hypernymy
Automated Text Clustering and Labeling using Hypernyms

How to extract meaning from sentences after running named entity recognition?

First: Any recs on how to modify the title?
I am using my own named entity recognition algorithm to parse data from plain text. Specifically, I am trying to extract lawyer practice areas. A common sentence structure that I see is:
1) Neil focuses his practice on employment, tax, and copyright litigation.
or
2) Neil focuses his practice on general corporate matters including securities, business organizations, contract preparation, and intellectual property protection.
My entity extraction is doing a good job of finding the key words, for example, my output from sentence one might look like this:
Neil focuses his practice on (employment), (tax), and (copyright litigation).
However, that doesn't really help me. What would be more helpful is if i got an output that looked more like this:
Neil focuses his practice on (employment - litigation), (tax - litigation), and (copyright litigation).
Is there a way to accomplish this goal using an existing python framework such as nltk (after my algo extracts the practice areas) can I use ntlk to extract the other words that my "practice areas" modify in order to get a more complete picture?

Named entity recognition (NER) systems typically use grammer-based rules or statistical language models. What you have described here seems to be based only on keywords, though.
Typically, and much like most complex NLP tasks, NER systems should be trained on domain-specific data so that they perform well on previously unseen (test) data. You will require adequate knowledge of machine learning to go down that path.
In "normal" language, if you want to extract words or phrases and categorize them into classes defined by you (e.g. litigation), if often makes sense to use category labels in external ontologies. An example could be:
You want to extract words and phrases related to sports.
Such a categorization (i.e. detecting whether or not a word is indeed related to sports) is not a "general"-enough problem. Which means you will not find ready-made systems that will solve the problem (e.g. algorithms in the NLTK library). You can, however, use an ontology like Wikipedia and exploit the category labels available there.
E.g., you can check that if you search Wikipedia for "football", which has a category label "ball games", which in turn is under "sports".
Note that the wikipedia category labels form a directed graph. If you build a system which exploits the category structure of such an ontology, you should be able to categorize terms in your texts as you see fit. Moreover, you can even control the granularity of the categorization (e.g. do you want just "sports", or "individual sports" and "team sports").
I have built such a system for categorizing terms related to computer science, and it worked remarkably well. The closest freely available system that works in a similar way is the Wikifier built by the cognitive computing group at the University of Illinois at Urbana-Champaign.
Caveat: You may need to tweak a simple category-based code to suit your needs. E.g. there is no wikipedia page for "litigation". Instead, it redirects you to a page titled "lawsuit". Such cases need to be handled separately.
Final Note: This solution is really not in the area of NLP, but my past experience suggests that for some domains, this kind of ontology-based approach works really well. Also, I have used the "sports" example in my answer because I know nothing about legal terminology. But I hope my example helps you understand the underlying process.

I do not think your "algo" is even doing entity recognition... however, stretching the problem you presented quite a bit, what you want to do looks like coreference resolution in coordinated structures containing ellipsis. Not easy at all: start by googling for some relevant literature in linguistics and computational linguistics. I use the standard terminology from the field below.
In practical terms, you could start by assigning the nearest antecedent (the most frequently used approach in English). Using your examples:
first extract all the "entities" in a sentence
from the entity list, identify antecedent candidates ("litigation", etc.). This is a very difficult task, involving many different problems... you might avoid it if you know in advance the "entities" that will be interesting for you.
finally, you assign (resolve) each anaphora/cataphora to the nearest antecedent.

Have a look at CogComp NER tagger:
https://github.com/CogComp/cogcomp-nlp/tree/master/ner

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.