How to automatically detect words in a string without spaces [duplicate] - python

Input: "tableapplechairtablecupboard..." many words
What would be an efficient algorithm to split such text to the list of words and get:
Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]
First thing that cames to mind is to go through all possible words (starting with first letter) and find the longest word possible, continue from position=word_position+len(word)
P.S.
We have a list of all possible words.
Word "cupboard" can be "cup" and "board", select longest.
Language: python, but main thing is the algorithm itself.

A naive algorithm won't give good results when applied to real-world data. Here is a 20-line algorithm that exploits relative word frequency to give accurate results for real-word text.
(If you want an answer to your original question which does not use word frequency, you need to refine what exactly is meant by "longest word": is it better to have a 20-letter word and ten 3-letter words, or is it better to have five 10-letter words? Once you settle on a precise definition, you just have to change the line defining wordcost to reflect the intended meaning.)
The idea
The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.
Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.
The code
from math import log
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
which you can use with
s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))
The results
I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.
Before: thumbgreenappleactiveassignmentweeklymetaphor.
After: thumb green apple active assignment weekly metaphor.
Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen
odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho
rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery
whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.
After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.
Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.
After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.
As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.
Optimization
The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.
If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.

Based on the excellent work in the top answer, I've created a pip package for easy use.
>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']
To install, run pip install wordninja.
The only differences are minor. This returns a list rather than a str, it works in python3, it includes the word list and properly splits even if there are non-alpha chars (like underscores, dashes, etc).
Thanks again to Generic Human!
https://github.com/keredson/wordninja

Here is solution using recursive search:
def find_words(instring, prefix = '', words = None):
if not instring:
return []
if words is None:
words = set()
with open('/usr/share/dict/words') as f:
for line in f:
words.add(line.strip())
if (not prefix) and (instring in words):
return [instring]
prefix, suffix = prefix + instring[0], instring[1:]
solutions = []
# Case 1: prefix in solution
if prefix in words:
try:
solutions.append([prefix] + find_words(suffix, '', words))
except ValueError:
pass
# Case 2: prefix not in solution
try:
solutions.append(find_words(suffix, prefix, words))
except ValueError:
pass
if solutions:
return sorted(solutions,
key = lambda solution: [len(word) for word in solution],
reverse = True)[0]
else:
raise ValueError('no solution')
print(find_words('tableapplechairtablecupboard'))
print(find_words('tableprechaun', words = set(['tab', 'table', 'leprechaun'])))
yields
['table', 'apple', 'chair', 'table', 'cupboard']
['tab', 'leprechaun']

Using a trie data structure, which holds the list of possible words, it would not be too complicated to do the following:
Advance pointer (in the concatenated string)
Lookup and store the corresponding node in the trie
If the trie node has children (e.g. there are longer words), go to 1.
If the node reached has no children, a longest word match happened; add the word (stored in the node or just concatenated during trie traversal) to the result list, reset the pointer in the trie (or reset the reference), and start over

The answer by Generic Human is great. But the best implementation of this I've ever seen was written Peter Norvig himself in his book 'Beautiful Data'.
Before I paste his code, let me expand on why Norvig's method is more accurate (although a little slower and longer in terms of code).
The data is a bit better - both in terms of size and in terms of precision (he uses a word count rather than a simple ranking)
More importantly, it's the logic behind n-grams that really makes the approach so accurate.
The example he provides in his book is the problem of splitting a string 'sitdown'. Now a non-bigram method of string split would consider p('sit') * p ('down'), and if this less than the p('sitdown') - which will be the case quite often - it will NOT split it, but we'd want it to (most of the time).
However when you have the bigram model you could value p('sit down') as a bigram vs p('sitdown') and the former wins. Basically, if you don't use bigrams, it treats the probability of the words you're splitting as independent, which is not the case, some words are more likely to appear one after the other. Unfortunately those are also the words that are often stuck together in a lot of instances and confuses the splitter.
Here's the link to the data (it's data for 3 separate problems and segmentation is only one. Please read the chapter for details): http://norvig.com/ngrams/
and here's the link to the code: http://norvig.com/ngrams/ngrams.py
These links have been up a while, but I'll copy paste the segmentation part of the code here anyway
import re, string, random, glob, operator, heapq
from collections import defaultdict
from math import log10
def memo(f):
"Memoize function f."
table = {}
def fmemo(*args):
if args not in table:
table[args] = f(*args)
return table[args]
fmemo.memo = table
return fmemo
def test(verbose=None):
"""Run some tests, taken from the chapter.
Since the hillclimbing algorithm is randomized, some tests may fail."""
import doctest
print 'Running tests...'
doctest.testfile('ngrams-test.txt', verbose=verbose)
################ Word Segmentation (p. 223)
#memo
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
candidates = ([first]+segment(rem) for first,rem in splits(text))
return max(candidates, key=Pwords)
def splits(text, L=20):
"Return a list of all possible (first, rem) pairs, len(first)<=L."
return [(text[:i+1], text[i+1:])
for i in range(min(len(text), L))]
def Pwords(words):
"The Naive Bayes probability of a sequence of words."
return product(Pw(w) for w in words)
#### Support functions (p. 224)
def product(nums):
"Return the product of a sequence of numbers."
return reduce(operator.mul, nums, 1)
class Pdist(dict):
"A probability distribution estimated from counts in datafile."
def __init__(self, data=[], N=None, missingfn=None):
for key,count in data:
self[key] = self.get(key, 0) + int(count)
self.N = float(N or sum(self.itervalues()))
self.missingfn = missingfn or (lambda k, N: 1./N)
def __call__(self, key):
if key in self: return self[key]/self.N
else: return self.missingfn(key, self.N)
def datafile(name, sep='\t'):
"Read key,value pairs from file."
for line in file(name):
yield line.split(sep)
def avoid_long_words(key, N):
"Estimate the probability of an unknown word."
return 10./(N * 10**len(key))
N = 1024908267229 ## Number of tokens
Pw = Pdist(datafile('count_1w.txt'), N, avoid_long_words)
#### segment2: second version, with bigram counts, (p. 226-227)
def cPw(word, prev):
"Conditional probability of word, given previous word."
try:
return P2w[prev + ' ' + word]/float(Pw[prev])
except KeyError:
return Pw(word)
P2w = Pdist(datafile('count_2w.txt'), N)
#memo
def segment2(text, prev='<S>'):
"Return (log P(words), words), where words is the best segmentation."
if not text: return 0.0, []
candidates = [combine(log10(cPw(first, prev)), first, segment2(rem, first))
for first,rem in splits(text)]
return max(candidates)
def combine(Pfirst, first, (Prem, rem)):
"Combine first and rem results into one (probability, words) pair."
return Pfirst+Prem, [first]+rem

Unutbu's solution was quite close but I find the code difficult to read, and it didn't yield the expected result. Generic Human's solution has the drawback that it needs word frequencies. Not appropriate for all use case.
Here's a simple solution using a Divide and Conquer algorithm.
It tries to minimize the number of words E.g. find_words('cupboard') will return ['cupboard'] rather than ['cup', 'board'] (assuming that cupboard, cup and board are in the dictionnary)
The optimal solution is not unique, the implementation below returns a solution. find_words('charactersin') could return ['characters', 'in'] or maybe it will return ['character', 'sin'] (as seen below). You could quite easily modify the algorithm to return all optimal solutions.
In this implementation solutions are memoized so that it runs in a reasonable time.
The code:
words = set()
with open('/usr/share/dict/words') as f:
for line in f:
words.add(line.strip())
solutions = {}
def find_words(instring):
# First check if instring is in the dictionnary
if instring in words:
return [instring]
# No... But maybe it's a result we already computed
if instring in solutions:
return solutions[instring]
# Nope. Try to split the string at all position to recursively search for results
best_solution = None
for i in range(1, len(instring) - 1):
part1 = find_words(instring[:i])
part2 = find_words(instring[i:])
# Both parts MUST have a solution
if part1 is None or part2 is None:
continue
solution = part1 + part2
# Is the solution found "better" than the previous one?
if best_solution is None or len(solution) < len(best_solution):
best_solution = solution
# Remember (memoize) this solution to avoid having to recompute it
solutions[instring] = best_solution
return best_solution
This will take about about 5sec on my 3GHz machine:
result = find_words("thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearenodelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetaphorapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquerywhetherthewordisreasonablesowhatsthefastestwayofextractionthxalot")
assert(result is not None)
print ' '.join(result)
the reis masses of text information of peoples comments which is parsed from h t m l but there are no delimited character sin them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple e t c in the string i also have a large dictionary to query whether the word is reasonable so whats the fastest way of extraction t h x a lot

Here is the accepted answer translated to JavaScript (requires node.js, and the file "wordninja_words.txt" from https://github.com/keredson/wordninja):
var fs = require("fs");
var splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
var maxWordLen = 0;
var wordCost = {};
fs.readFile("./wordninja_words.txt", 'utf8', function(err, data) {
if (err) {
throw err;
}
var words = data.split('\n');
words.forEach(function(word, index) {
wordCost[word] = Math.log((index + 1) * Math.log(words.length));
})
words.forEach(function(word) {
if (word.length > maxWordLen)
maxWordLen = word.length;
});
console.log(maxWordLen)
splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
console.log(split(process.argv[2]));
});
function split(s) {
var list = [];
s.split(splitRegex).forEach(function(sub) {
_split(sub).forEach(function(word) {
list.push(word);
})
})
return list;
}
module.exports = split;
function _split(s) {
var cost = [0];
function best_match(i) {
var candidates = cost.slice(Math.max(0, i - maxWordLen), i).reverse();
var minPair = [Number.MAX_SAFE_INTEGER, 0];
candidates.forEach(function(c, k) {
if (wordCost[s.substring(i - k - 1, i).toLowerCase()]) {
var ccost = c + wordCost[s.substring(i - k - 1, i).toLowerCase()];
} else {
var ccost = Number.MAX_SAFE_INTEGER;
}
if (ccost < minPair[0]) {
minPair = [ccost, k + 1];
}
})
return minPair;
}
for (var i = 1; i < s.length + 1; i++) {
cost.push(best_match(i)[0]);
}
var out = [];
i = s.length;
while (i > 0) {
var c = best_match(i)[0];
var k = best_match(i)[1];
if (c == cost[i])
console.log("Alert: " + c);
var newToken = true;
if (s.slice(i - k, i) != "'") {
if (out.length > 0) {
if (out[-1] == "'s" || (Number.isInteger(s[i - 1]) && Number.isInteger(out[-1][0]))) {
out[-1] = s.slice(i - k, i) + out[-1];
newToken = false;
}
}
}
if (newToken) {
out.push(s.slice(i - k, i))
}
i -= k
}
return out.reverse();
}

If you precompile the wordlist into a DFA (which will be very slow), then the time it takes to match an input will be proportional to the length of the string (in fact, only a little slower than just iterating over the string).
This is effectively a more general version of the trie algorithm which was mentioned earlier. I only mention it for completeless -- as of yet, there's no DFA implementation you can just use. RE2 would work, but I don't know if the Python bindings let you tune how large you allow a DFA to be before it just throws away the compiled DFA data and does NFA search.

This will help
from wordsegment import load, segment
load()
segment('providesfortheresponsibilitiesofperson')

Many Thanks for the help in https://github.com/keredson/wordninja/
A small contribution of the same in Java from my side.
The public method splitContiguousWords could be embedded with the other 2 methods in the class having ninja_words.txt in same directory(or modified as per the choice of coder). And the method splitContiguousWords could be used for the purpose.
public List<String> splitContiguousWords(String sentence) {
String splitRegex = "[^a-zA-Z0-9']+";
Map<String, Number> wordCost = new HashMap<>();
List<String> dictionaryWords = IOUtils.linesFromFile("ninja_words.txt", StandardCharsets.UTF_8.name());
double naturalLogDictionaryWordsCount = Math.log(dictionaryWords.size());
long wordIdx = 0;
for (String word : dictionaryWords) {
wordCost.put(word, Math.log(++wordIdx * naturalLogDictionaryWordsCount));
}
int maxWordLength = Collections.max(dictionaryWords, Comparator.comparing(String::length)).length();
List<String> splitWords = new ArrayList<>();
for (String partSentence : sentence.split(splitRegex)) {
splitWords.add(split(partSentence, wordCost, maxWordLength));
}
log.info("Split word for the sentence: {}", splitWords);
return splitWords;
}
private String split(String partSentence, Map<String, Number> wordCost, int maxWordLength) {
List<Pair<Number, Number>> cost = new ArrayList<>();
cost.add(new Pair<>(Integer.valueOf(0), Integer.valueOf(0)));
for (int index = 1; index < partSentence.length() + 1; index++) {
cost.add(bestMatch(partSentence, cost, index, wordCost, maxWordLength));
}
int idx = partSentence.length();
List<String> output = new ArrayList<>();
while (idx > 0) {
Pair<Number, Number> candidate = bestMatch(partSentence, cost, idx, wordCost, maxWordLength);
Number candidateCost = candidate.getKey();
Number candidateIndexValue = candidate.getValue();
if (candidateCost.doubleValue() != cost.get(idx).getKey().doubleValue()) {
throw new RuntimeException("Candidate cost unmatched; This should not be the case!");
}
boolean newToken = true;
String token = partSentence.substring(idx - candidateIndexValue.intValue(), idx);
if (token != "\'" && output.size() > 0) {
String lastWord = output.get(output.size() - 1);
if (lastWord.equalsIgnoreCase("\'s") ||
(Character.isDigit(partSentence.charAt(idx - 1)) && Character.isDigit(lastWord.charAt(0)))) {
output.set(output.size() - 1, token + lastWord);
newToken = false;
}
}
if (newToken) {
output.add(token);
}
idx -= candidateIndexValue.intValue();
}
return String.join(" ", Lists.reverse(output));
}
private Pair<Number, Number> bestMatch(String partSentence, List<Pair<Number, Number>> cost, int index,
Map<String, Number> wordCost, int maxWordLength) {
List<Pair<Number, Number>> candidates = Lists.reverse(cost.subList(Math.max(0, index - maxWordLength), index));
int enumerateIdx = 0;
Pair<Number, Number> minPair = new Pair<>(Integer.MAX_VALUE, Integer.valueOf(enumerateIdx));
for (Pair<Number, Number> pair : candidates) {
++enumerateIdx;
String subsequence = partSentence.substring(index - enumerateIdx, index).toLowerCase();
Number minCost = Integer.MAX_VALUE;
if (wordCost.containsKey(subsequence)) {
minCost = pair.getKey().doubleValue() + wordCost.get(subsequence).doubleValue();
}
if (minCost.doubleValue() < minPair.getKey().doubleValue()) {
minPair = new Pair<>(minCost.doubleValue(), enumerateIdx);
}
}
return minPair;
}

It seems like fairly mundane backtracking will do. Start at the beggining of the string. Scan right until you have a word. Then, call the function on the rest of the string. Function returns "false" if it scans all the way to the right without recognizing a word. Otherwise, returns the word it found and the list of words returned by the recursive call.
Example: "tableapple". Finds "tab", then "leap", but no word in "ple". No other word in "leapple". Finds "table", then "app". "le" not a word, so tries apple, recognizes, returns.
To get longest possible, keep going, only emitting (rather than returning) correct solutions; then, choose the optimal one by any criterion you choose (maxmax, minmax, average, etc.)

If you have an exhaustive list of the words contained within the string:
word_list = ["table", "apple", "chair", "cupboard"]
Using list comprehension to iterate over the list to locate the word and how many times it appears.
string = "tableapplechairtablecupboard"
def split_string(string, word_list):
return ("".join([(item + " ")*string.count(item.lower()) for item in word_list if item.lower() in string])).strip()
The function returns a string output of words in order of the list table table apple chair cupboard

Here is a sample code (based on some examples) with vanilla (pure) JavaScript. Make sure to add words base (sample.txt) to use:
async function getSampleText(data) {
await fetch('sample.txt').then(response => response.text())
.then(text => {
const wordList = text;
// Create a regular expression for splitting the input string.
const splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
// Initialize the variables for storing the maximum word length and the word costs.
let maxWordLen = 0;
let wordCost = {};
// Split the word list into an array of words.
const words = wordList.split('\n');
// Calculate the word costs based on the word list.
words.forEach((word, index) => {
wordCost[word] = Math.log((index + 1) * Math.log(words.length));
});
// Find the maximum word length.
words.forEach((word) => {
if (word.length > maxWordLen) {
maxWordLen = word.length;
}
});
console.log(maxWordLen);
//console.log(split(process.argv[2]));
/**
* Split the input string into an array of words.
* #param {string} s The input string.
* #return {Array} The array of words.
*/
function split(s) {
const list = [];
s.split(splitRegex).forEach((sub) => {
_split(sub).forEach((word) => {
list.push(word);
});
});
return list;
}
/**
* Split the input string into an array of words.
* #private
* #param {string} s The input string.
* #return {Array} The array of words.
*/
function _split(s) {
const cost = [0];
/**
* Find the best match for the i first characters, assuming cost has been built for the i-1 first characters.
* #param {number} i The index of the character to find the best match for.
* #return {Array} A pair containing the match cost and match length.
*/
function best_match(i) {
const candidates = cost.slice(Math.max(0, i - maxWordLen), i).reverse();
let minPair = [Number.MAX_SAFE_INTEGER, 0];
candidates.forEach((c, k) => {
let ccost;
if (wordCost[s.substring(i - k - 1, i).toLowerCase()]) {
ccost = c + wordCost[s.substring(i - k - 1, i).toLowerCase()];
} else {
ccost = Number.MAX_SAFE_INTEGER;
}
if (ccost < minPair[0]) {
minPair = [ccost, k + 1];
}
});
return minPair;
}
// Build the cost array.
for (let i = 1; i < s.length + 1; i++) {
cost.push(best_match(i)[0]);
}
// Backtrack to recover the minimal-cost string.
const out = [];
let i = s.length;
while (i > 0) {
const c = best_match(i)[0];
const k = best_match(i)[1];
if (c === cost[i]) {
console.log("Done: " + c);
}
let newToken = true;
if (s.slice(i - k, i) !== "'") {
if (out.length > 0) {
if (out[-1] === "'s" || (Number.isInteger(s[i - 1]) && Number.isInteger(out[-1][0]))) {
out[-1] = s.slice(i - k, i) + out[-1];
newToken = false;
}
}
}
if (newToken) {
out.push(s.slice(i - k, i));
}
i -= k;
}
return out.reverse();
}
console.log(split('Thiswasaveryniceday'));
})
}
getSampleText();

You need to identify your vocabulary - perhaps any free word list will do.
Once done, use that vocabulary to build a suffix tree, and match your stream of input against that: http://en.wikipedia.org/wiki/Suffix_tree

Based on unutbu's solution I've implemented a Java version:
private static List<String> splitWordWithoutSpaces(String instring, String suffix) {
if(isAWord(instring)) {
if(suffix.length() > 0) {
List<String> rest = splitWordWithoutSpaces(suffix, "");
if(rest.size() > 0) {
List<String> solutions = new LinkedList<>();
solutions.add(instring);
solutions.addAll(rest);
return solutions;
}
} else {
List<String> solutions = new LinkedList<>();
solutions.add(instring);
return solutions;
}
}
if(instring.length() > 1) {
String newString = instring.substring(0, instring.length()-1);
suffix = instring.charAt(instring.length()-1) + suffix;
List<String> rest = splitWordWithoutSpaces(newString, suffix);
return rest;
}
return Collections.EMPTY_LIST;
}
Input: "tableapplechairtablecupboard"
Output: [table, apple, chair, table, cupboard]
Input: "tableprechaun"
Output: [tab, leprechaun]

For German language there is CharSplit which uses machine learning and works pretty good for strings of a few words.
https://github.com/dtuggener/CharSplit

Expanding on #miku's suggestion to use a Trie, an append-only Trie is relatively straight-forward to implement in python:
class Node:
def __init__(self, is_word=False):
self.children = {}
self.is_word = is_word
class TrieDictionary:
def __init__(self, words=tuple()):
self.root = Node()
for word in words:
self.add(word)
def add(self, word):
node = self.root
for c in word:
node = node.children.setdefault(c, Node())
node.is_word = True
def lookup(self, word, from_node=None):
node = self.root if from_node is None else from_node
for c in word:
try:
node = node.children[c]
except KeyError:
return None
return node
We can then build a Trie-based dictionary from a set of words:
dictionary = {"a", "pea", "nut", "peanut", "but", "butt", "butte", "butter"}
trie_dictionary = TrieDictionary(words=dictionary)
Which will produce a tree that looks like this (* indicates beginning or end of a word):
* -> a*
\\\
\\\-> p -> e -> a*
\\ \-> n -> u -> t*
\\
\\-> b -> u -> t*
\\ \-> t*
\\ \-> e*
\\ \-> r*
\
\-> n -> u -> t*
We can incorporate this into a solution by combining it with a heuristic about how to choose words. For example we can prefer longer words over shorter words:
def using_trie_longest_word_heuristic(s):
node = None
possible_indexes = []
# O(1) short-circuit if whole string is a word, doesn't go against longest-word wins
if s in dictionary:
return [ s ]
for i in range(len(s)):
# traverse the trie, char-wise to determine intermediate words
node = trie_dictionary.lookup(s[i], from_node=node)
# no more words start this way
if node is None:
# iterate words we have encountered from biggest to smallest
for possible in possible_indexes[::-1]:
# recurse to attempt to solve the remaining sub-string
end_of_phrase = using_trie_longest_word_heuristic(s[possible+1:])
# if we have a solution, return this word + our solution
if end_of_phrase:
return [ s[:possible+1] ] + end_of_phrase
# unsolvable
break
# if this is a leaf, append the index to the possible words list
elif node.is_word:
possible_indexes.append(i)
# empty string OR unsolvable case
return []
We can use this function like this:
>>> using_trie_longest_word_heuristic("peanutbutter")
[ "peanut", "butter" ]
Because we maintain our position in the Trie as we search for longer and longer words, we traverse the trie at most once per possible solution (rather than 2 times for peanut: pea, peanut). The final short-circuit saves us from walking char-wise through the string in the worst-case.
The final result is only a handful of inspections:
'peanutbutter' - not a word, go charwise
'p' - in trie, use this node
'e' - in trie, use this node
'a' - in trie and edge, store potential word and use this node
'n' - in trie, use this node
'u' - in trie, use this node
't' - in trie and edge, store potential word and use this node
'b' - not in trie from `peanut` vector
'butter' - remainder of longest is a word
A benefit of this solution is in the fact that you know very quickly if longer words exist with a given prefix, which spares the need to exhaustively test sequence combinations against a dictionary. It also makes getting to an unsolvable answer comparatively cheap to other implementations.
The downsides of this solution are a large memory footprint for the trie and the cost of building the trie up-front.

Related

Generating a string from substrings in a dictionary, whilst minimising certain characters being next to each other

I want to be able to generate a string from a dictionary containing substrings, whereby I input a string where each character corresponds to the key of the dictionary and it spits out a new string from the associated values to that key. However I also want to minimise certain characters being next to each other.
For example:
dict = {'I': ['ATA', 'ATC', 'ATT'], 'M': ['ATG'], 'T': ['ACA', 'ACC', 'ACG', 'ACT'], 'N':['AAC', 'AAT'], 'K': ['AAA', 'AAG'], 'S': ['AGC', 'AGT'], 'R': ['AGA', 'AGG']}
input_str = "IIMTSTTKRI"
The output would be a string of the three character substrings associated with each key.However there are many 3 character substrings that could be used, I would like to minimise the number of G's and C's that are next to one another.
I currently have this:
n = []
#make list of possible substrings for each character in string
for i in str:
if i in dict.keys():
n.append(dict[i])
#generate all permutations
p = [''.join(s) for s in itertools.product(*n)]
#if no consecutive GCs in a permutation add to list
ls = []
for i in p:
q = i.count('GC')
if q == 0:
ls.append(i)
Which 'works' but there are a couple of problems. The first (minor one) is that I have to assume the consective "GC" is 0 and for some strings that may not be possible. The second (major one), is its extremely slow for longer strings because it has to generate all permutations.
Can anyone provide a way to improve the speed or an alternative way?
Based on your comments, you can look at the problem as a optimal path searching (Think about your problem as a graph where you must follow path defined in input_str and in each vertex you must chose from a list of defined 3-character wide strings).
There are many search algorithms, my solution is using A*Star:
from heapq import heappop, heappush
dct = {
"I": ["ATA", "ATC", "ATT"],
"M": ["ATG"],
"T": ["ACA", "ACC", "ACG", "ACT"],
"N": ["AAC", "AAT"],
"K": ["AAA", "AAG"],
"S": ["AGC", "AGT"],
"R": ["AGA", "AGG"],
}
input_str = "IIMTSTTKRI"
def valid_moves(s):
key = input_str[len(s) // 3]
for i in dct[key]:
yield s + i
def distance(s):
return len(input_str) - (len(s) // 3)
def my_cost_func(_from, _to):
return _to.count("GC")
def a_star(start, moves_func, h_func, cost_func):
"""
Find a shortest sequence of states from start to a goal state
(a state s with h_func(s) == 0).
"""
frontier = [
(h_func(start), start)
] # A priority queue, ordered by path length, f = g + h
previous = {
start: None
} # start state has no previous state; other states will
path_cost = {start: 0} # The cost of the best path to a state.
Path = lambda s: ([] if (s is None) else Path(previous[s]) + [s])
while frontier:
(f, s) = heappop(frontier)
if h_func(s) == 0:
return Path(s)
for s2 in moves_func(s):
g = path_cost[s] + cost_func(s, s2)
if s2 not in path_cost or g < path_cost[s2]:
heappush(frontier, (g + h_func(s2), s2))
path_cost[s2] = g
previous[s2] = s
path = a_star("", valid_moves, distance, my_cost_func)
print("Result:", path[-1])
This prints:
Result: ATAATAATGACAAGTACAACAAAAAGAATAAGT

Understanding another's text-mining function that removes similar strings

I’m trying to replicate the methodology from this article, 538 Post about Most Repetitive Phrases, in which the author mined US presidential debate transcripts to determine the most repetitive phrases for each candidate.
I'm trying to implement this methodology with another dataset in R with the tm package.
Most of the code (GitHub repository) concerns mining the transcripts and assembling counts of each ngram, but I get lost at the prune_substrings() function code below:
def prune_substrings(tfidf_dicts, prune_thru=1000):
pruned = tfidf_dicts
for candidate in range(len(candidates)):
# growing list of n-grams in list form
so_far = []
ngrams_sorted = sorted(tfidf_dicts[candidate].items(), key=operator.itemgetter(1), reverse=True)[:prune_thru]
for ngram in ngrams_sorted:
# contained in a previous aka 'better' phrase
for better_ngram in so_far:
if overlap(list(better_ngram), list(ngram[0])):
#print "PRUNING!! "
#print list(better_ngram)
#print list(ngram[0])
pruned[candidate][ngram[0]] = 0
# not contained, so add to so_far to prevent future subphrases
else:
so_far += [list(ngram[0])]
return pruned
The input of the function, tfidf_dicts, is an array of dictionaries (one for each candidate) with ngrams as keys and tf-idf scores as values. For example, Trump's tf-idf dict begins like this:
trump.tfidf.dict = {'we don't win': 83.2, 'you have to': 72.8, ... }
so the structure of the input is like this:
tfidf_dicts = {trump.tfidf.dict, rubio.tfidf.dict, etc }
MY understanding is that prune_substrings does the following things, but I'm stuck on the else if clause, which is a pythonic thing I don't understand yet.
A. create list : pruned as tfidf_dicts; a list of tfidf dicts for each candidate
B loop through each candidate:
so_far = start an empty list of ngrams gone through so so_far
ngrams_sorted = sorted member's tf-idf dict from smallest to biggest
loop through each ngram in sorted
loop through each better_ngram in so_far
IF overlap b/w (below) == TRUE:
better_ngram (from so_far) and
ngram (from ngrams_sorted)
THEN zero out tf-idf for ngram
ELSE if (WHAT?!?)
add ngram to list, so_far
C. return pruned, i.e. list of unique ngrams sorted in order
Any help at all is much appreciated!
Note the indentation in your code... The else is lined up with the second for, not the if. This is a for-else construct, not an if-else.
In that case, the else is being used to initialize the inner loop, because it will be executed when so_far is empty the first time through, and each time the inner loop runs out of items to iterate through...
I am not sure that this is the most efficient way to achieve these comparisons, but conceptually you can get a sense of the flow with this snippet:
s=[]
for j in "ABCD":
for i in s:
print i,
else:
print "\nelse"
s.append(j)
Output:
else
A
else
A B
else
A B C
else
I would think that in R there is a much better way to do this than nested loops....
4 months later but here's my solution. I'm sure there is a more efficient solution, but for my purposes, it worked. The pythonic for-else doesn't translate to R. So the steps are different.
Take top n ngrams.
Create a list, t, where each element of the list is a logical vector of length n that says whether ngram in question overlaps all other ngrams (but fix 1:x to be false automatically)
Cbind together every element of t into a table, t2
Return only elements of t2 row sum is zero
set elements 1:n to FALSE (i.e. no overlap)
Ouala!
PrunedList Function
#' GetPrunedList
#'
#' takes a word freq df with columns Words and LenNorm, returns df of nonoverlapping strings
GetPrunedList <- function(wordfreqdf, prune_thru = 100) {
#take only first n items in list
tmp <- head(wordfreqdf, n = prune_thru) %>%
select(ngrams = Words, tfidfXlength = LenNorm)
#for each ngram in list:
t <- (lapply(1:nrow(tmp), function(x) {
#find overlap between ngram and all items in list (overlap = TRUE)
idx <- overlap(tmp[x, "ngrams"], tmp$ngrams)
#set overlap as false for itself and higher-scoring ngrams
idx[1:x] <- FALSE
idx
}))
#bind each ngram's overlap vector together to make a matrix
t2 <- do.call(cbind, t)
#find rows(i.e. ngrams) that do not overlap with those below
idx <- rowSums(t2) == 0
pruned <- tmp[idx,]
rownames(pruned) <- NULL
pruned
}
Overlap function
#' overlap
#' OBJ: takes two ngrams (as strings) and to see if they overlap
#' INPUT: a,b ngrams as strings
#' OUTPUT: TRUE if overlap
overlap <- function(a, b) {
max_overlap <- min(3, CountWords(a), CountWords(b))
a.beg <- word(a, start = 1L, end = max_overlap)
a.end <- word(a, start = -max_overlap, end = -1L)
b.beg <- word(b, start = 1L, end = max_overlap)
b.end <- word(b, start = -max_overlap, end = -1L)
# b contains a's beginning
w <- str_detect(b, coll(a.beg, TRUE))
# b contains a's end
x <- str_detect(b, coll(a.end, TRUE))
# a contains b's beginning
y <- str_detect(a, coll(b.beg, TRUE))
# a contains b's end
z <- str_detect(a, coll(b.end, TRUE))
#return TRUE if any of above are true
(w | x | y | z)
}

Finding the max depth of a set in a dictionary

I have a dictionary where the key is a string and the values of the key are a set of strings that also contain the key (word chaining). I'm having trouble finding the max depth of a graph, which would be the set with the most elements in the dictionary, and I'm try print out that max graph as well.
Right now my code prints:
{'DOG': [],
'HIPPOPOTIMUS': [],
'POT': ['SUPERPOT', 'HIPPOPOTIMUS'],
'SUPERPOT': []}
1
Where 1 is my maximum dictionary depth. I was expecting the depth to be two, but there appears to be only 1 layer to the graph of 'POT'
How can I find the maximum value set from the set of keys in a dictionary?
import pprint
def dict_depth(d, depth=0):
if not isinstance(d, dict) or not d:
return depth
print max(dict_depth(v, depth+1) for k, v in d.iteritems())
def main():
for keyCheck in wordDict:
for keyCompare in wordDict:
if keyCheck in keyCompare:
if keyCheck != keyCompare:
wordDict[keyCheck].append(keyCompare)
if __name__ == "__main__":
#load the words into a dictionary
wordDict = dict((x.strip(), []) for x in open("testwordlist.txt"))
main()
pprint.pprint (wordDict)
dict_depth(wordDict)
testwordlist.txt:
POT
SUPERPOT
HIPPOPOTIMUS
DOG
The "depth" of a dictionary will naturally be 1 plus the maximum depth of its entries. You've defined the depth of a non-dictionary to be zero. Since your top-level dictionary doesn't contain any dictionaries of its own, the depth of your dictionary is clearly 1. Your function reports that value correctly.
However, your function isn't written expecting the data format you're providing it. We can easily come up with inputs where the depth of substring chains is more than just one. For example:
DOG
DOGMA
DOGMATIC
DOGHOUSE
POT
Output of your current script:
{'DOG': ['DOGMATIC', 'DOGMA', 'DOGHOUSE'],
'DOGHOUSE': [],
'DOGMA': ['DOGMATIC'],
'DOGMATIC': [],
'POT': []}
1
I think you want to get 2 for that input because the longest substring chain is DOG → DOGMA → DOGMATIC, which contains two hops.
To get the depth of a dictionary as you've structured it, you want to calculate the chain length for each word. That's 1 plus the maximum chain length of each of its substrings, which gives us the following two functions:
def word_chain_length(d, w):
if len(d[w]) == 0:
return 0
return 1 + max(word_chain_length(d, ww) for ww in d[w])
def dict_depth(d):
print(max(word_chain_length(d, w) for w in d))
The word_chain_length function given here isn't particularly efficient. It may end up calculating the lengths of the same chain multiple times if a string is a substring of many words. Dynamic programming is a simple way to mitigate that, which I'll leave as an exercise.
Sorry my examples wont be in python because my python is rusty but you should get the idea.
Lets say this is a binary tree:
(written in c++)
int depth(TreeNode* root){
if(!root) return 0;
return 1+max(depth(root->left), depth(root->right));
}
Simple. Now lets expand this too more then just a left and right.
(golang code)
func depthfunc(Dic dic) (int){
if dic == nil {
return 0
}
level := make([]int,0)
for key, anotherDic := range dic{
depth := 1
if ok := anotherDic.(Dic); ok { // check if it does down further
depth = 1 + depthfunc(anotherDic)
}
level = append(level, depth)
}
//find max
max := 0
for _, value := range level{
if value > max {
max = value
}
}
return max
}
The idea is that you just go down each dictionary until there is no more dictionaries to go down adding 1 to each level you traverse.

Merge/Discard overlapping words

I want to merge strings (words) that are similar (string is within other string).
word
wor
words
wormhole
hole
Would make:
words
wormhole
As wor overlaps with: word, words, wormhole -wor is discarded;
word overlaps with: words - word is discarded;
hole overlaps with: wormhole - hole is discarded;
but words, wormhole don't overlap - so they stay.
How can I do this?
Edit
My solution is:
while read a
do
grep $a FILE |
awk 'length > m { m = length; a = $0 } END { print a }'
done < FILE |
sort -u
But I don't know if it would't cause troubles with large datasets.
In Ruby:
list = %w[word wor words wormhole]
list.uniq
.tap{|a| a.reverse_each{|e| a.delete(e) if (a - [e]).any?{|x| x.include?(e)}}}
With a sufficiently long list of words, any nested loop over the words is going to be painfully slow. This is how I'd do it:
use strict;
use warnings;
use File::Slurp 'read_file';
chomp( my #words = read_file('/usr/share/dict/words') );
my %overlapped;
for my $word (#words) {
$word =~ /(.*)(?{++$overlapped{$1}})(*FAIL)/;
--$overlapped{$word};
}
print "$_\n" for grep ! $overlapped{$_}, #words;
It could perhaps be improved with Darshan Computing's suggestion of processing words longest to shortest.
You can use a hash to count the substrings of your list of words:
use strict;
use warnings;
use feature 'say';
my %seen; # seen substrings
my #words; # original list
while (<DATA>) { # read a new substring
chomp;
push #words, $_; # store the original
while (length) { # while a substring remains
$seen{$_}++; # increase its counter
chop; # shorten the substring
}
}
# All original words with count == 1 are the merged list
my #merged = grep $seen{$_} == 1, #words;
say for #merged;
__DATA__
w
word
wor
words
wormhole
hole
holes
Output:
words
wormhole
holes
Of course, you will need to compensate for case, punctuation and whitespace, as hash keys are exact, and the key Foo is different from the key foo.
It seems to me that sorting the words longest-to-shortest, we can then step through the sorted list only once, matching only against kept words. I'm poor at algorithmic analysis, but this makes sense to me and I think the performance would be good. It also seems to work, assuming the order of the kept words doesn't matter:
words = ['word', 'wor', 'words', 'wormhole', 'hole']
keepers = []
words.sort_by(&:length).reverse.each do |word|
keepers.push(word) if ! keepers.any?{|keeper| keeper.include?(word)}
end
keepers
# => ["wormhole", "words"]
If the order of the kept words does matter, it would be pretty easy to modify this to account for that. One option would simply be:
words & keepers
# => ["words", "wormhole"]
amon's suggestion of...
Sort the list of all words in ascending order. If a word is a
substring of the next word, discard current word; move on otherwise.
...would require O(n log n) for the sort, and I'm not sure about the time complexity of Ashwini's solution, but it looks to be more than O(n log n).
I think this is an O(n) solution...
from collections import defaultdict
words = ['word', 'wor', 'words', 'wormhole']
infinite_defaultdict = lambda: defaultdict(infinite_defaultdict)
mydict = infinite_defaultdict()
for word in words:
d = mydict
for char in word:
d = d[char]
result = []
for word in words:
d = mydict
for char in word:
d = d[char]
if not d:
result.append(word)
print result
...which prints...
['words', 'wormhole']
Update
But I don't know if it would't cause troubles with large datasets.
For comparison, using 10,000 words from /usr/share/dict/words, this takes about 70 milliseconds of CPU time, whereas Ashwini's takes about 11 seconds.
Update 2
Okay. The original question read as if words could only overlap at the start, but if they can overlap anywhere, this code won't work. I think any algorithm which could do that would have a worst-case complexity of O(n²).
Use a list comprehension with any/all:
>>> lis = ['word','wor', 'words', 'wormhole']
#all
>>> [x for x in lis if all(x not in y for y in lis if y != x)]
['words', 'wormhole']
#any
>>> [x for x in lis if not any(x in y for y in lis if y != x)]
['words', 'wormhole']
You can also use marisa_trie here :
>>> import marisa_trie
>>> lis = ['word','wor', 'words', 'wormhole', 'hole', 'holes']
>>> def trie(lis):
trie = marisa_trie.Trie(lis)
return [x for x in lis if len(trie.keys(unicode(x))) ==1 ]
...
>>> trie(lis)
['words', 'wormhole', 'holes']
I understand your question as
Given a word list, we want to remove all those words that are substrings of other words.
Here is a general Perl solution:
sub weed_out {
my #out;
WORD:
while (my $current = shift) {
for (#_) {
# skip $current word if it's a substring of any other word
next WORD if -1 != index $_, $current;
}
push #out, $current;
}
return #out;
}
Note that we shift from the #_ argument array, thus the inner loop gets shorter each time.
If we encounter a word that is a substring of the $current word while doing the inner loop, we actually can remove it via splice:
WORD:
while (my $current = shift) {
for (my $i = 0; ; $i++) {
last unless $i <= $#_; # loop condition must be here
# remove the other word if it's a substring of $current
splice(#_, $i, 1), redo if -1 != index $current, $_[$i];
# skip $current word if it's a substring of any other word
next WORD if -1 != index $_[$i], $current;
}
push #out, $current;
}
But I'd rather benchmark that “optimization”.
This can be easily embedded into a shell script if needed:
$ perl - <<'END' FILE
my #words = <>;
chomp(#words);
WORD: while (my $current = shift #words) {
for (#words) {
# skip $current word if it's a substring of any other word
next WORD if -1 != index $_, $current;
}
print "$current\n";
}
END
Using awk:
awk '
NR==FNR {
a[$1]++
next
}
{
for (x in a) {
if (index ($1,x) == 0) {
a[x]
}
else {
delete a[x]
a[$1]
}
}
}
END {
for (x in a) {
print x
}
}' inputFile inputFile
Test:
inputFile of:
word
wormholes
wor
words
wormhole
hole
Returns:
words
wormholes
Lengthy perl oneliner,
perl -nE 'chomp;($l,$p)=($_,0); #w=grep{ $p=1 if /$l/; $p|| $l!~/$_/} #w; $p or push #w,$l}{say for #w' file
a bash solution:
#!/bin/bash
dict="word wor words wormhole hole "
uniq=()
sort_by_length() {
for word; do
printf "%d %s\n" ${#word} "$word"
done | sort -n | cut -d " " -f2-
}
set -- $(sort_by_length $dict)
while [[ $# -gt 0 ]]; do
word=$1
shift
found=false
for w; do
if [[ $w == *"$word"* ]]; then
found=true
break
fi
done
if ! $found; then
uniq+=($word)
fi
done
echo "${uniq[#]}"

How to split text without spaces into list of words

Input: "tableapplechairtablecupboard..." many words
What would be an efficient algorithm to split such text to the list of words and get:
Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]
First thing that cames to mind is to go through all possible words (starting with first letter) and find the longest word possible, continue from position=word_position+len(word)
P.S.
We have a list of all possible words.
Word "cupboard" can be "cup" and "board", select longest.
Language: python, but main thing is the algorithm itself.
A naive algorithm won't give good results when applied to real-world data. Here is a 20-line algorithm that exploits relative word frequency to give accurate results for real-word text.
(If you want an answer to your original question which does not use word frequency, you need to refine what exactly is meant by "longest word": is it better to have a 20-letter word and ten 3-letter words, or is it better to have five 10-letter words? Once you settle on a precise definition, you just have to change the line defining wordcost to reflect the intended meaning.)
The idea
The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.
Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.
The code
from math import log
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
which you can use with
s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))
The results
I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.
Before: thumbgreenappleactiveassignmentweeklymetaphor.
After: thumb green apple active assignment weekly metaphor.
Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen
odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho
rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery
whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.
After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.
Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.
After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.
As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.
Optimization
The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.
If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.
Based on the excellent work in the top answer, I've created a pip package for easy use.
>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']
To install, run pip install wordninja.
The only differences are minor. This returns a list rather than a str, it works in python3, it includes the word list and properly splits even if there are non-alpha chars (like underscores, dashes, etc).
Thanks again to Generic Human!
https://github.com/keredson/wordninja
Here is solution using recursive search:
def find_words(instring, prefix = '', words = None):
if not instring:
return []
if words is None:
words = set()
with open('/usr/share/dict/words') as f:
for line in f:
words.add(line.strip())
if (not prefix) and (instring in words):
return [instring]
prefix, suffix = prefix + instring[0], instring[1:]
solutions = []
# Case 1: prefix in solution
if prefix in words:
try:
solutions.append([prefix] + find_words(suffix, '', words))
except ValueError:
pass
# Case 2: prefix not in solution
try:
solutions.append(find_words(suffix, prefix, words))
except ValueError:
pass
if solutions:
return sorted(solutions,
key = lambda solution: [len(word) for word in solution],
reverse = True)[0]
else:
raise ValueError('no solution')
print(find_words('tableapplechairtablecupboard'))
print(find_words('tableprechaun', words = set(['tab', 'table', 'leprechaun'])))
yields
['table', 'apple', 'chair', 'table', 'cupboard']
['tab', 'leprechaun']
Using a trie data structure, which holds the list of possible words, it would not be too complicated to do the following:
Advance pointer (in the concatenated string)
Lookup and store the corresponding node in the trie
If the trie node has children (e.g. there are longer words), go to 1.
If the node reached has no children, a longest word match happened; add the word (stored in the node or just concatenated during trie traversal) to the result list, reset the pointer in the trie (or reset the reference), and start over
The answer by Generic Human is great. But the best implementation of this I've ever seen was written Peter Norvig himself in his book 'Beautiful Data'.
Before I paste his code, let me expand on why Norvig's method is more accurate (although a little slower and longer in terms of code).
The data is a bit better - both in terms of size and in terms of precision (he uses a word count rather than a simple ranking)
More importantly, it's the logic behind n-grams that really makes the approach so accurate.
The example he provides in his book is the problem of splitting a string 'sitdown'. Now a non-bigram method of string split would consider p('sit') * p ('down'), and if this less than the p('sitdown') - which will be the case quite often - it will NOT split it, but we'd want it to (most of the time).
However when you have the bigram model you could value p('sit down') as a bigram vs p('sitdown') and the former wins. Basically, if you don't use bigrams, it treats the probability of the words you're splitting as independent, which is not the case, some words are more likely to appear one after the other. Unfortunately those are also the words that are often stuck together in a lot of instances and confuses the splitter.
Here's the link to the data (it's data for 3 separate problems and segmentation is only one. Please read the chapter for details): http://norvig.com/ngrams/
and here's the link to the code: http://norvig.com/ngrams/ngrams.py
These links have been up a while, but I'll copy paste the segmentation part of the code here anyway
import re, string, random, glob, operator, heapq
from collections import defaultdict
from math import log10
def memo(f):
"Memoize function f."
table = {}
def fmemo(*args):
if args not in table:
table[args] = f(*args)
return table[args]
fmemo.memo = table
return fmemo
def test(verbose=None):
"""Run some tests, taken from the chapter.
Since the hillclimbing algorithm is randomized, some tests may fail."""
import doctest
print 'Running tests...'
doctest.testfile('ngrams-test.txt', verbose=verbose)
################ Word Segmentation (p. 223)
#memo
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
candidates = ([first]+segment(rem) for first,rem in splits(text))
return max(candidates, key=Pwords)
def splits(text, L=20):
"Return a list of all possible (first, rem) pairs, len(first)<=L."
return [(text[:i+1], text[i+1:])
for i in range(min(len(text), L))]
def Pwords(words):
"The Naive Bayes probability of a sequence of words."
return product(Pw(w) for w in words)
#### Support functions (p. 224)
def product(nums):
"Return the product of a sequence of numbers."
return reduce(operator.mul, nums, 1)
class Pdist(dict):
"A probability distribution estimated from counts in datafile."
def __init__(self, data=[], N=None, missingfn=None):
for key,count in data:
self[key] = self.get(key, 0) + int(count)
self.N = float(N or sum(self.itervalues()))
self.missingfn = missingfn or (lambda k, N: 1./N)
def __call__(self, key):
if key in self: return self[key]/self.N
else: return self.missingfn(key, self.N)
def datafile(name, sep='\t'):
"Read key,value pairs from file."
for line in file(name):
yield line.split(sep)
def avoid_long_words(key, N):
"Estimate the probability of an unknown word."
return 10./(N * 10**len(key))
N = 1024908267229 ## Number of tokens
Pw = Pdist(datafile('count_1w.txt'), N, avoid_long_words)
#### segment2: second version, with bigram counts, (p. 226-227)
def cPw(word, prev):
"Conditional probability of word, given previous word."
try:
return P2w[prev + ' ' + word]/float(Pw[prev])
except KeyError:
return Pw(word)
P2w = Pdist(datafile('count_2w.txt'), N)
#memo
def segment2(text, prev='<S>'):
"Return (log P(words), words), where words is the best segmentation."
if not text: return 0.0, []
candidates = [combine(log10(cPw(first, prev)), first, segment2(rem, first))
for first,rem in splits(text)]
return max(candidates)
def combine(Pfirst, first, (Prem, rem)):
"Combine first and rem results into one (probability, words) pair."
return Pfirst+Prem, [first]+rem
Unutbu's solution was quite close but I find the code difficult to read, and it didn't yield the expected result. Generic Human's solution has the drawback that it needs word frequencies. Not appropriate for all use case.
Here's a simple solution using a Divide and Conquer algorithm.
It tries to minimize the number of words E.g. find_words('cupboard') will return ['cupboard'] rather than ['cup', 'board'] (assuming that cupboard, cup and board are in the dictionnary)
The optimal solution is not unique, the implementation below returns a solution. find_words('charactersin') could return ['characters', 'in'] or maybe it will return ['character', 'sin'] (as seen below). You could quite easily modify the algorithm to return all optimal solutions.
In this implementation solutions are memoized so that it runs in a reasonable time.
The code:
words = set()
with open('/usr/share/dict/words') as f:
for line in f:
words.add(line.strip())
solutions = {}
def find_words(instring):
# First check if instring is in the dictionnary
if instring in words:
return [instring]
# No... But maybe it's a result we already computed
if instring in solutions:
return solutions[instring]
# Nope. Try to split the string at all position to recursively search for results
best_solution = None
for i in range(1, len(instring) - 1):
part1 = find_words(instring[:i])
part2 = find_words(instring[i:])
# Both parts MUST have a solution
if part1 is None or part2 is None:
continue
solution = part1 + part2
# Is the solution found "better" than the previous one?
if best_solution is None or len(solution) < len(best_solution):
best_solution = solution
# Remember (memoize) this solution to avoid having to recompute it
solutions[instring] = best_solution
return best_solution
This will take about about 5sec on my 3GHz machine:
result = find_words("thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearenodelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetaphorapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquerywhetherthewordisreasonablesowhatsthefastestwayofextractionthxalot")
assert(result is not None)
print ' '.join(result)
the reis masses of text information of peoples comments which is parsed from h t m l but there are no delimited character sin them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple e t c in the string i also have a large dictionary to query whether the word is reasonable so whats the fastest way of extraction t h x a lot
Here is the accepted answer translated to JavaScript (requires node.js, and the file "wordninja_words.txt" from https://github.com/keredson/wordninja):
var fs = require("fs");
var splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
var maxWordLen = 0;
var wordCost = {};
fs.readFile("./wordninja_words.txt", 'utf8', function(err, data) {
if (err) {
throw err;
}
var words = data.split('\n');
words.forEach(function(word, index) {
wordCost[word] = Math.log((index + 1) * Math.log(words.length));
})
words.forEach(function(word) {
if (word.length > maxWordLen)
maxWordLen = word.length;
});
console.log(maxWordLen)
splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
console.log(split(process.argv[2]));
});
function split(s) {
var list = [];
s.split(splitRegex).forEach(function(sub) {
_split(sub).forEach(function(word) {
list.push(word);
})
})
return list;
}
module.exports = split;
function _split(s) {
var cost = [0];
function best_match(i) {
var candidates = cost.slice(Math.max(0, i - maxWordLen), i).reverse();
var minPair = [Number.MAX_SAFE_INTEGER, 0];
candidates.forEach(function(c, k) {
if (wordCost[s.substring(i - k - 1, i).toLowerCase()]) {
var ccost = c + wordCost[s.substring(i - k - 1, i).toLowerCase()];
} else {
var ccost = Number.MAX_SAFE_INTEGER;
}
if (ccost < minPair[0]) {
minPair = [ccost, k + 1];
}
})
return minPair;
}
for (var i = 1; i < s.length + 1; i++) {
cost.push(best_match(i)[0]);
}
var out = [];
i = s.length;
while (i > 0) {
var c = best_match(i)[0];
var k = best_match(i)[1];
if (c == cost[i])
console.log("Alert: " + c);
var newToken = true;
if (s.slice(i - k, i) != "'") {
if (out.length > 0) {
if (out[-1] == "'s" || (Number.isInteger(s[i - 1]) && Number.isInteger(out[-1][0]))) {
out[-1] = s.slice(i - k, i) + out[-1];
newToken = false;
}
}
}
if (newToken) {
out.push(s.slice(i - k, i))
}
i -= k
}
return out.reverse();
}
If you precompile the wordlist into a DFA (which will be very slow), then the time it takes to match an input will be proportional to the length of the string (in fact, only a little slower than just iterating over the string).
This is effectively a more general version of the trie algorithm which was mentioned earlier. I only mention it for completeless -- as of yet, there's no DFA implementation you can just use. RE2 would work, but I don't know if the Python bindings let you tune how large you allow a DFA to be before it just throws away the compiled DFA data and does NFA search.
This will help
from wordsegment import load, segment
load()
segment('providesfortheresponsibilitiesofperson')
Many Thanks for the help in https://github.com/keredson/wordninja/
A small contribution of the same in Java from my side.
The public method splitContiguousWords could be embedded with the other 2 methods in the class having ninja_words.txt in same directory(or modified as per the choice of coder). And the method splitContiguousWords could be used for the purpose.
public List<String> splitContiguousWords(String sentence) {
String splitRegex = "[^a-zA-Z0-9']+";
Map<String, Number> wordCost = new HashMap<>();
List<String> dictionaryWords = IOUtils.linesFromFile("ninja_words.txt", StandardCharsets.UTF_8.name());
double naturalLogDictionaryWordsCount = Math.log(dictionaryWords.size());
long wordIdx = 0;
for (String word : dictionaryWords) {
wordCost.put(word, Math.log(++wordIdx * naturalLogDictionaryWordsCount));
}
int maxWordLength = Collections.max(dictionaryWords, Comparator.comparing(String::length)).length();
List<String> splitWords = new ArrayList<>();
for (String partSentence : sentence.split(splitRegex)) {
splitWords.add(split(partSentence, wordCost, maxWordLength));
}
log.info("Split word for the sentence: {}", splitWords);
return splitWords;
}
private String split(String partSentence, Map<String, Number> wordCost, int maxWordLength) {
List<Pair<Number, Number>> cost = new ArrayList<>();
cost.add(new Pair<>(Integer.valueOf(0), Integer.valueOf(0)));
for (int index = 1; index < partSentence.length() + 1; index++) {
cost.add(bestMatch(partSentence, cost, index, wordCost, maxWordLength));
}
int idx = partSentence.length();
List<String> output = new ArrayList<>();
while (idx > 0) {
Pair<Number, Number> candidate = bestMatch(partSentence, cost, idx, wordCost, maxWordLength);
Number candidateCost = candidate.getKey();
Number candidateIndexValue = candidate.getValue();
if (candidateCost.doubleValue() != cost.get(idx).getKey().doubleValue()) {
throw new RuntimeException("Candidate cost unmatched; This should not be the case!");
}
boolean newToken = true;
String token = partSentence.substring(idx - candidateIndexValue.intValue(), idx);
if (token != "\'" && output.size() > 0) {
String lastWord = output.get(output.size() - 1);
if (lastWord.equalsIgnoreCase("\'s") ||
(Character.isDigit(partSentence.charAt(idx - 1)) && Character.isDigit(lastWord.charAt(0)))) {
output.set(output.size() - 1, token + lastWord);
newToken = false;
}
}
if (newToken) {
output.add(token);
}
idx -= candidateIndexValue.intValue();
}
return String.join(" ", Lists.reverse(output));
}
private Pair<Number, Number> bestMatch(String partSentence, List<Pair<Number, Number>> cost, int index,
Map<String, Number> wordCost, int maxWordLength) {
List<Pair<Number, Number>> candidates = Lists.reverse(cost.subList(Math.max(0, index - maxWordLength), index));
int enumerateIdx = 0;
Pair<Number, Number> minPair = new Pair<>(Integer.MAX_VALUE, Integer.valueOf(enumerateIdx));
for (Pair<Number, Number> pair : candidates) {
++enumerateIdx;
String subsequence = partSentence.substring(index - enumerateIdx, index).toLowerCase();
Number minCost = Integer.MAX_VALUE;
if (wordCost.containsKey(subsequence)) {
minCost = pair.getKey().doubleValue() + wordCost.get(subsequence).doubleValue();
}
if (minCost.doubleValue() < minPair.getKey().doubleValue()) {
minPair = new Pair<>(minCost.doubleValue(), enumerateIdx);
}
}
return minPair;
}
It seems like fairly mundane backtracking will do. Start at the beggining of the string. Scan right until you have a word. Then, call the function on the rest of the string. Function returns "false" if it scans all the way to the right without recognizing a word. Otherwise, returns the word it found and the list of words returned by the recursive call.
Example: "tableapple". Finds "tab", then "leap", but no word in "ple". No other word in "leapple". Finds "table", then "app". "le" not a word, so tries apple, recognizes, returns.
To get longest possible, keep going, only emitting (rather than returning) correct solutions; then, choose the optimal one by any criterion you choose (maxmax, minmax, average, etc.)
If you have an exhaustive list of the words contained within the string:
word_list = ["table", "apple", "chair", "cupboard"]
Using list comprehension to iterate over the list to locate the word and how many times it appears.
string = "tableapplechairtablecupboard"
def split_string(string, word_list):
return ("".join([(item + " ")*string.count(item.lower()) for item in word_list if item.lower() in string])).strip()
The function returns a string output of words in order of the list table table apple chair cupboard
Here is a sample code (based on some examples) with vanilla (pure) JavaScript. Make sure to add words base (sample.txt) to use:
async function getSampleText(data) {
await fetch('sample.txt').then(response => response.text())
.then(text => {
const wordList = text;
// Create a regular expression for splitting the input string.
const splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
// Initialize the variables for storing the maximum word length and the word costs.
let maxWordLen = 0;
let wordCost = {};
// Split the word list into an array of words.
const words = wordList.split('\n');
// Calculate the word costs based on the word list.
words.forEach((word, index) => {
wordCost[word] = Math.log((index + 1) * Math.log(words.length));
});
// Find the maximum word length.
words.forEach((word) => {
if (word.length > maxWordLen) {
maxWordLen = word.length;
}
});
console.log(maxWordLen);
//console.log(split(process.argv[2]));
/**
* Split the input string into an array of words.
* #param {string} s The input string.
* #return {Array} The array of words.
*/
function split(s) {
const list = [];
s.split(splitRegex).forEach((sub) => {
_split(sub).forEach((word) => {
list.push(word);
});
});
return list;
}
/**
* Split the input string into an array of words.
* #private
* #param {string} s The input string.
* #return {Array} The array of words.
*/
function _split(s) {
const cost = [0];
/**
* Find the best match for the i first characters, assuming cost has been built for the i-1 first characters.
* #param {number} i The index of the character to find the best match for.
* #return {Array} A pair containing the match cost and match length.
*/
function best_match(i) {
const candidates = cost.slice(Math.max(0, i - maxWordLen), i).reverse();
let minPair = [Number.MAX_SAFE_INTEGER, 0];
candidates.forEach((c, k) => {
let ccost;
if (wordCost[s.substring(i - k - 1, i).toLowerCase()]) {
ccost = c + wordCost[s.substring(i - k - 1, i).toLowerCase()];
} else {
ccost = Number.MAX_SAFE_INTEGER;
}
if (ccost < minPair[0]) {
minPair = [ccost, k + 1];
}
});
return minPair;
}
// Build the cost array.
for (let i = 1; i < s.length + 1; i++) {
cost.push(best_match(i)[0]);
}
// Backtrack to recover the minimal-cost string.
const out = [];
let i = s.length;
while (i > 0) {
const c = best_match(i)[0];
const k = best_match(i)[1];
if (c === cost[i]) {
console.log("Done: " + c);
}
let newToken = true;
if (s.slice(i - k, i) !== "'") {
if (out.length > 0) {
if (out[-1] === "'s" || (Number.isInteger(s[i - 1]) && Number.isInteger(out[-1][0]))) {
out[-1] = s.slice(i - k, i) + out[-1];
newToken = false;
}
}
}
if (newToken) {
out.push(s.slice(i - k, i));
}
i -= k;
}
return out.reverse();
}
console.log(split('Thiswasaveryniceday'));
})
}
getSampleText();
You need to identify your vocabulary - perhaps any free word list will do.
Once done, use that vocabulary to build a suffix tree, and match your stream of input against that: http://en.wikipedia.org/wiki/Suffix_tree
Based on unutbu's solution I've implemented a Java version:
private static List<String> splitWordWithoutSpaces(String instring, String suffix) {
if(isAWord(instring)) {
if(suffix.length() > 0) {
List<String> rest = splitWordWithoutSpaces(suffix, "");
if(rest.size() > 0) {
List<String> solutions = new LinkedList<>();
solutions.add(instring);
solutions.addAll(rest);
return solutions;
}
} else {
List<String> solutions = new LinkedList<>();
solutions.add(instring);
return solutions;
}
}
if(instring.length() > 1) {
String newString = instring.substring(0, instring.length()-1);
suffix = instring.charAt(instring.length()-1) + suffix;
List<String> rest = splitWordWithoutSpaces(newString, suffix);
return rest;
}
return Collections.EMPTY_LIST;
}
Input: "tableapplechairtablecupboard"
Output: [table, apple, chair, table, cupboard]
Input: "tableprechaun"
Output: [tab, leprechaun]
For German language there is CharSplit which uses machine learning and works pretty good for strings of a few words.
https://github.com/dtuggener/CharSplit
Expanding on #miku's suggestion to use a Trie, an append-only Trie is relatively straight-forward to implement in python:
class Node:
def __init__(self, is_word=False):
self.children = {}
self.is_word = is_word
class TrieDictionary:
def __init__(self, words=tuple()):
self.root = Node()
for word in words:
self.add(word)
def add(self, word):
node = self.root
for c in word:
node = node.children.setdefault(c, Node())
node.is_word = True
def lookup(self, word, from_node=None):
node = self.root if from_node is None else from_node
for c in word:
try:
node = node.children[c]
except KeyError:
return None
return node
We can then build a Trie-based dictionary from a set of words:
dictionary = {"a", "pea", "nut", "peanut", "but", "butt", "butte", "butter"}
trie_dictionary = TrieDictionary(words=dictionary)
Which will produce a tree that looks like this (* indicates beginning or end of a word):
* -> a*
\\\
\\\-> p -> e -> a*
\\ \-> n -> u -> t*
\\
\\-> b -> u -> t*
\\ \-> t*
\\ \-> e*
\\ \-> r*
\
\-> n -> u -> t*
We can incorporate this into a solution by combining it with a heuristic about how to choose words. For example we can prefer longer words over shorter words:
def using_trie_longest_word_heuristic(s):
node = None
possible_indexes = []
# O(1) short-circuit if whole string is a word, doesn't go against longest-word wins
if s in dictionary:
return [ s ]
for i in range(len(s)):
# traverse the trie, char-wise to determine intermediate words
node = trie_dictionary.lookup(s[i], from_node=node)
# no more words start this way
if node is None:
# iterate words we have encountered from biggest to smallest
for possible in possible_indexes[::-1]:
# recurse to attempt to solve the remaining sub-string
end_of_phrase = using_trie_longest_word_heuristic(s[possible+1:])
# if we have a solution, return this word + our solution
if end_of_phrase:
return [ s[:possible+1] ] + end_of_phrase
# unsolvable
break
# if this is a leaf, append the index to the possible words list
elif node.is_word:
possible_indexes.append(i)
# empty string OR unsolvable case
return []
We can use this function like this:
>>> using_trie_longest_word_heuristic("peanutbutter")
[ "peanut", "butter" ]
Because we maintain our position in the Trie as we search for longer and longer words, we traverse the trie at most once per possible solution (rather than 2 times for peanut: pea, peanut). The final short-circuit saves us from walking char-wise through the string in the worst-case.
The final result is only a handful of inspections:
'peanutbutter' - not a word, go charwise
'p' - in trie, use this node
'e' - in trie, use this node
'a' - in trie and edge, store potential word and use this node
'n' - in trie, use this node
'u' - in trie, use this node
't' - in trie and edge, store potential word and use this node
'b' - not in trie from `peanut` vector
'butter' - remainder of longest is a word
A benefit of this solution is in the fact that you know very quickly if longer words exist with a given prefix, which spares the need to exhaustively test sequence combinations against a dictionary. It also makes getting to an unsolvable answer comparatively cheap to other implementations.
The downsides of this solution are a large memory footprint for the trie and the cost of building the trie up-front.

Categories

Resources