How to generate homophones on substring level?

How to generate homophones on substring level? - python

I want to generate homophones of words programmatically. Meaning, words that sound similar to the original words.
I've come across the Soundex algorithm, but it just replaces some characters with other characters (like t instead of d). Are there any lists or algorithms that are a little bit more sophisticated, providing at least homophone substrings?
Important: I want to apply this on words that aren't in dictionaries, meaning that I can't rely on whole, real words.
EDIT:
The input is a string which is often a proper name and therefore in no standard (homophone) dictionary. An example could be Google or McDonald's (just to name two popular named entities, but many are much more unpopular).
The output is then a (random) homophone of this string. Since words often have more than one homophone, a single (random) one is my goal. In the case of Google, a homophone could be gugel, or MacDonald's for McDonald's.

How to do this well is a research topic. See for example http://www.inf.ufpr.br/didonet/articles/2014_FPSS.pdf.
But suppose that you want to roll your own.
The first step is figuring out how to turn the letters that you are given into a representation of what it sounds like. This is a very hard problem with guessing required. (eg What sound does "read" make? Depends on whether you are going to read, or you already read!) However text to phonemes converter suggests that Arabet has solved this for English.
Next you'll want this to have been done for every word in a dictionary. Assuming that you can do that for one word, that's just a script.
Then you'll want it stored in a data structure where you can easily find similar sounds. That is in principle no difference than the sort of algorithms that are used for autocorrect for spelling. Only with phonemes instead of letters. You can get a sense of how to do that with http://norvig.com/spell-correct.html. Or try to implement something like what is described in http://fastss.csg.uzh.ch/ifi-2007.02.pdf.
And that is it.

Related

How to find all words with any permutation of the given letters in SQL?

I am working with the sqlite3 module, using Python 3.10.0. I have created a database with a table of English words, where one of the columns is creatively named "word". My question is, how can I sample all the words that contain at most the letters within the given word? For example, if the input was "establishment", valid outputs could be "meant", "tame", "mate", "team", "establish", "neat", and so on. Invalid inputs consist of words with any other letters other than those found within the input. I have done some research on this, but the only thing I found which even comes close to this is using the LIKE keyword, which seems to be a limited version of regular expression matching. I mentioned using Python 3.10 because I think I read somewhere that sqlite3 supports user-defined functions, but I figured I'd ask first to see if somebody knows an easier solution.

Your question is extremely vague.
Let me answer a related question: "How may I efficiently find anagrams of a given word?"
There is a standard approach to this.
Simply alphabetize all letters within a word, and store them in sorted order.
So given a dictionary containing these "known" words,
we would have the first three map to the same string:
pale <--> aelp
peal <--> aelp
plea <--> aelp
plan <--> alnp
Now given a query word of "leap", how shall we efficiently find its anagrams?
Turn it into "aelp".
Query for that string, retrieving three matching dictionary words.
Sqlite is an excellent fit for such a task.
It can easily produce suitable column indexes.
Now let's return to your problem.
I suspect it's a bit more complex than anagrams.
Consider using a related approach.
Rip through each dictionary word, storing digrams in standard order.
So for "pale", we would store:
pale <--> ap
pale <--> al
pale <--> el
Repeat for all other dictionary words.
Then, at query time, given an input of "leap",
you might consult the database for "el", "ae", and "ap".
Notice that "ae" missed, there.
If that troubles you, when processing the whole dictionary
feel free to store all 2-letter combinations, even ones that aren't consecutive.
Possibly going to trigrams, or all 3-letter combinations, would prove helpful.
Spend some time working with the problem to find out.

Python3 - Doc2Vec: Get document by vector/ID

I've already built my Doc2Vec model, using around 20.000 files. I'm looking for a way to find the string representation of a given vector/ID, which might be similar to Word2Vec's index2entity. I'm able to get the vector itself, using model['n'], but now I'm wondering whether there's a way to get some sort of string representation of it as well.

If you want to look up your actual training text, for a given text+tag that was part of training, you should retain that mapping outside the Doc2Vec model. (The model doesn't store training texts – only looking at them, repeatedly, during training.)
If you want to generate a text from a Doc2Vec doc-vector, that's not an existing feature, nor do I know any published work describing a reliable technique for doing so.
There's a speculative/experimental bit of work-in-progress for gensim Doc2Vec that will forward-propagate a doc-vector through the model's neural-network, and report back the most-highly-predicted target words. (This is somewhat the opposite of the way infer_vector() works.)
That might, plausibly, give a sort-of summary text. For more details see this open issue & the attached PR-in-progress:
https://github.com/RaRe-Technologies/gensim/issues/2459
Whether this is truly useful or likely to become part of gensim is still unclear.
However, note that such a set-of-words wouldn't be grammatical. (It'll just be the ranked-list of most-predicted words. Perhaps some other subsystem could try to string those words together in a natural, grammatical way.)
Also, the subtleties of whether a concept has many potential associates words, or just one, could greatly affect the "top N" results of such a process. Contriving a possible example: there are many words for describing a 'cold' environment. As a result, a doc-vector for a text about something cold might have lots of near-synonyms for 'cold' in the 11th-20th ranked positions – such that the "total likelihood" of at least one cold-ish word is very high, maybe higher than any one other word. But just looking at the top-10 most-predicted words might instead list other "purer" words whose likelihood isn't so divided, and miss the (more-important-overall) sense of "coldness". So, this experimental pseudo-summarization method might benefit from a second-pass that somehow "coalesces" groups-of-related-words into their most-representative words, until some overall proportion (rather than fixed top-N) of the doc-vector's predicted-words are communicated. (This process might be vaguely like finding a set of M words whose "Word Mover's Distance" to the full set of predicted-words is minimized – though that could be a very expensive search.)

Find most SIMILAR sentence/string to a reference one in text corpus in python

my goal is very simple: I have a set of strings or a sentence and I want to find the most similar one within a text corpus.
For example I have the following text corpus: "The front of the library is adorned with the Word of Life mural designed by artist Millard Sheets."
And I'd like to find the substring of the original corpus which is most similar to: "the library facade is painted"
So what I should get as output is: "fhe front of the library is adorned"
The only thing I came up with is to split the original sentence in substrings of variable lengths (eg. in substrings of 3,4,5 strings) and then use something like string.similarity(substring) from the spacy python module to assess the similarities of my target text with all the substrings and then keep the one with the highest value.
It seems a pretty inefficient method. Is there anything better I can do?

It probably works to some degree, but I wouldn't expect the spacy similarity method (averaging word vectors) to work particularly well.
The task you're working on is related to paraphrase detection/identification and semantic textual similarity and there is a lot of existing work. It is frequently used for things like plagiarism detection and the evaluation of machine translation systems, so you might find more approaches by looking in those areas, too.
If you want something that works fairly quickly out of the box for English, one suggestion is terp, which was developed for MT evaluation but shown to work well for paraphrase detection:
https://github.com/snover/terp
Most methods are set up to compare two sentences, so this doesn't address your potential partial sentence matches. Maybe it would make sense to find the most similar sentence and then look for substrings within that sentence that match better than the sentence as a whole?

Negation handling in NLP

I'm currently working on a project, where I want to extract emotion from text. As I'm using conceptnet5 (a semantic network), I can't however simply prefix words in a sentence that contains a negation-word, as those words would simply not show up in conceptnet5's API.
Here's an example:
The movie wasn't that good.
Hence, I figured that I could use wordnet's lemma functionality to replace adjectives in sentences that contain negation-words like (not, ...).
In the previous example, the algorithm would detect wasn't and would replace it with was not.
Further, it would detect a negation-word not, and replace good with it's antonym bad.
The sentence would read:
The movie was that bad.
While I see that this isn't the most elegant way, and it does probably in many cases produce the wrong result, I'd still like to handle negation that way as I frankly don't know any better approach.
Considering my problem:
Unfortunately, I did not find any library that would allow me to replace all occurrences of appended negation-words (wasn't => was not).
I mean I could do it manually, by replacing the occurrences with a regex, but then I would be stuck with the english language.
Therefore I'd like to ask if some of you know a library, function or better method that could help me here.
Currently I'm using python nltk, still it doesn't seem that it contains such functionality, but I may be wrong.
Thanks in advance :)

Cases like wasn't can be simply parsed by tokenization (tokens = nltk.word_tokenize(sentence)): wasn't will turn into was and n't.
But negative meaning can also be formed by 'Quasi negative words, like hardly, barely, seldom' and 'Implied negatives, such as fail, prevent, reluctant, deny, absent', look into this paper. Even more detailed analysis can be found in Christopher Potts' On the negativity of negation
.
Considering your initial problem, sentiment analysis, most modern approaches, as far as I know, don't process negations explicitly; instead, they use supervised approaches with high-order n-grams. Those actually processing negation usually append special prefix NOT_ to all words between negation and punctuation marks.

Creating a list of words from Wikipedia

I am creating a game and I need a dictionary (a list of plain words in this case) containing not only the base form, but all the others as well. In this case the language is Italian and, for example, the verbs have many forms and nouns too.
Since the language is very irregular, I want to get the words from a huge source which may contain them all. At first I thought about Wikipedia: I would download every article, extract the text, and filter the words.
This will take so much time that I'd like to know whether there could be better solutions, both in terms of time and completeness of the list.

If you're on a Linux system you might want to look in /usr/share/dict/words.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.