I need to loop trough n lines of a file and for any i between 1 and n-1 to get the difference between words of line(n-1) - line(n) (eg. line[i]word[j] - line[i+1]word[j] etc .. )
Input :
Hey there !
Hey thre !
What a suprise.
What a uprise.
I don't know what to do.
I don't know wt to do.
Output:
e
s
ha
The goal is to extract the missing character(s) between two consecutive line words only.
I'm new to python so if you can guide me through writing the code, I would be more than thankful.
Without any lib :
def extract_missing_chars(s1, s2):
if len(s1) < len(s2):
return extract_missing_chars(s2, s1)
i = 0
to_return = []
for c in s1:
if s2[i] != c:
to_return.append(c)
else:
i += 1
return to_return
f = open('testfile')
l1 = f.readline()
while l1:
l2 = f.readline()
print(''.join(extract_missing_chars(l1, l2)))
l1 = f.readline()
Your example indicates that you want the comparisons between pairs of lines. This is different from defining it as line(n-1)-line(n) which would give you 5 results, not 3.
The result also depends on what you consider to be differences. Is it positional, is it simply based on missing letters from the odd lines or are the differences applicable in both directions.
(e.g. "boat"-"tub" = "boat", "oa" or "oau" ?).
You also have to decide if you want the differences to be case sensitive or not.
Here's an example where computation of the differences is centralized in a function so that you can change the rules more easily. It assumes that "boat"-"tub" = "oau".
lines = """Hey there !
Hey thre !
What a suprise.
What a uprise.
I don't know what to do.
I don't know wt to do.
""".split('\n')
def differences(word1,word2):
if isinstance(word1,list):
return "".join( differences(w1,w2) for w1,w2 in zip(word1+[""]*len(word2),word2+[""]*len(word1)) )
return "".join( c*abs(word1.count(c)-word2.count(c)) for c in set(word1+word2) )
result = [ differences(line1.split(),line2.split()) for line1,line2 in zip(lines[::2],lines[1::2]) ]
# ['e', 's', 'ha']
Note that line processing for result is based on your example (not on your definition).
Related
I am looking to be able to recursively remove adjacent letters in a string that differ only in their case e.g. if s = AaBbccDd i would want to be able to remove Aa Bb Dd but leave cc.
I can do this recursively using lists:
I think it aught to be able to be done using regex but i am struggling:
with test string 'fffAaaABbe' the answer should be 'fffe' but the regex I am using gives 'fe'
def test(line):
res = re.compile(r'(.)\1{1}', re.IGNORECASE)
#print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
The way that works is:
def test(line):
result =''
chr = list(line)
cnt = 0
i = len(chr) - 1
while i > 0:
if ord(chr[i]) == ord(chr[i - 1]) + 32 or ord(chr[i]) == ord(chr[i - 1]) - 32:
cnt += 1
chr.pop(i)
chr.pop(i - 1)
i -= 2
else:
i -= 1
if cnt > 0: # until we can't find any duplicates.
return test(''.join(chr))
result = ''.join(chr)
print(result)
Is it possible to do this using a regex?
re.IGNORECASE is not way to solve this problem, as it will treat aa, Aa, aA, AA same way. Technically it is possible using re.sub, following way.
import re
txt = 'fffAaaABbe'
after_sub = re.sub(r'Aa|aA|Bb|bB|Cc|cC|Dd|dD|Ee|eE|Ff|fF|Gg|gG|Hh|hH|Ii|iI|Jj|jJ|Kk|kK|Ll|lL|Mm|mM|Nn|nN|Oo|oO|Pp|pP|Qq|qQ|Rr|rR|Ss|sS|Tt|tT|Uu|uU|Vv|vV|Ww|wW|Xx|xX|Yy|yY|Zz|zZ', '', txt)
print(after_sub) # fffe
Note that I explicitly defined all possible letters pairs, because so far I know there is no way to say "inverted case letter" using just re pattern. Maybe other user will be able to provide more concise re-based solution.
I suggest a different approach which uses groupby to group adjacent similar letters:
from itertools import groupby
def test(line):
res = []
for k, g in groupby(line, key=lambda x: x.lower()):
g = list(g)
if all(x == x.lower() for x in g):
res.append(''.join(g))
print(''.join(res))
Sample run:
>>> test('AaBbccDd')
cc
>>> test('fffAaaABbe')
fffe
r'(.)\1{1}' is wrong because it will match any character that is repeated twice, including non-letter characters. If you want to stick to letters, you can't use this.
However, even if we just do r'[A-z]\1{1}', this would still be bad because you would match any sequence of the same letter twice, but it would catch xx and XX -- you don't want to match consecutive same characters with matching case, as you said in the original question.
It just so happens that there is no short-hand to do this conveniently, but it is still possible. You could also just write a small function to turn it into a short-hand.
Building on #Daweo's answer, you can generate the regex pattern needed to match pairs of same letters with non-matching case to get the final pattern of aA|Aa|bB|Bb|cC|Cc|dD|Dd|eE|Ee|fF|Ff|gG|Gg|hH|Hh|iI|Ii|jJ|Jj|kK|Kk|lL|Ll|mM|Mm|nN|Nn|oO|Oo|pP|Pp|qQ|Qq|rR|Rr|sS|Ss|tT|Tt|uU|Uu|vV|Vv|wW|Ww|xX|Xx|yY|Yy|zZ|Zz:
import re
import string
def consecutiveLettersNonMatchingCase():
# Get all 'xX|Xx' with a list comprehension
# and join them with '|'
return '|'.join(['{0}{1}|{1}{0}'.format(s, t)\
# Iterate through the upper/lowercase characters
# in lock-step
for s, t in zip(
string.ascii_lowercase,
string.ascii_uppercase)])
def test(line):
res = re.compile(consecutiveLettersNonMatchingCase())
print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
print(consecutiveLettersNonMatchingCase())
I'm trying to write a program that mimics a word game where, from a given set of words, it will find the longest possible sequence of words. No word can be used twice.
I can do the matching letters and words up, and storing them into lists, but I'm having trouble getting my head around how to handle the potentially exponential number of possibilities of words in lists. If word 1 matches word 2 and then I go down that route, how do I then back up to see if words 3 or 4 match up with word one and then start their own routes, all stemming from the first word?
I was thinking some way of calling the function inside itself maybe?
I know it's nowhere near doing what I need it to do, but it's a start. Thanks in advance for any help!
g = "audino bagon baltoy banette bidoof braviary bronzor carracosta charmeleon cresselia croagunk darmanitan deino emboar emolga exeggcute gabite girafarig gulpin haxorus"
def pokemon():
count = 1
names = g.split()
first = names[count]
master = []
for i in names:
print (i, first, i[0], first[-1])
if i[0] == first[-1] and i not in master:
master.append(i)
count += 1
first = i
print ("success", master)
if len(master) == 0:
return "Pokemon", first, "does not work"
count += 1
first = names[count]
pokemon()
Your idea of calling a function inside of itself is a good one. We can solve this with recursion:
def get_neighbors(word, choices):
return set(x for x in choices if x[0] == word[-1])
def longest_path_from(word, choices):
choices = choices - set([word])
neighbors = get_neighbors(word, choices)
if neighbors:
paths = (longest_path_from(w, choices) for w in neighbors)
max_path = max(paths, key=len)
else:
max_path = []
return [word] + max_path
def longest_path(choices):
return max((longest_path_from(w, choices) for w in choices), key=len)
Now we just define our word list:
words = ("audino bagon baltoy banette bidoof braviary bronzor carracosta "
"charmeleon cresselia croagunk darmanitan deino emboar emolga "
"exeggcute gabite girafarig gulpin haxorus")
words = frozenset(words.split())
Call longest_path with a set of words:
>>> longest_path(words)
['girafarig', 'gabite', 'exeggcute', 'emolga', 'audino']
A couple of things to know: as you point out, this has exponential complexity, so beware! Also, know that python has a recursion limit!
Using some black magic and graph theory I found a partial solution that might be good (not thoroughly tested).
The idea is to map your problem into a graph problem rather than a simple iterative problem (although it might work too!). So I defined the nodes of the graph to be the first letters and last letters of your words. I can only create edges between nodes of type first and last. I cannot map node first number X to node last number X (a word cannot be followed by it self). And from that your problem is just the same as the Longest path problem which tends to be NP-hard for general case :)
By taking some information here: stackoverflow-17985202 I managed to write this:
g = "audino bagon baltoy banette bidoof braviary bronzor carracosta charmeleon cresselia croagunk darmanitan deino emboar emolga exeggcute gabite girafarig gulpin haxorus"
words = g.split()
begin = [w[0] for w in words] # Nodes first
end = [w[-1] for w in words] # Nodes last
links = []
for i, l in enumerate(end): # Construct edges
ok = True
offset = 0
while ok:
try:
bl = begin.index(l, offset)
if i != bl: # Cannot map to self
links.append((i, bl))
offset = bl + 1 # next possible edge
except ValueError: # no more possible edge for this last node, Next!
ok = False
# Great function shamelessly taken from stackoverflow (link provided above)
import networkx as nx
def longest_path(G):
dist = {} # stores [node, distance] pair
for node in nx.topological_sort(G):
# pairs of dist,node for all incoming edges
pairs = [(dist[v][0]+1,v) for v in G.pred[node]]
if pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
node,(length,_) = max(dist.items(), key=lambda x:x[1])
path = []
while length > 0:
path.append(node)
length,node = dist[node]
return list(reversed(path))
# Construct graph
G = nx.DiGraph()
G.add_edges_from(links)
# TADAAAA!
print(longest_path(G))
Although it looks nice, there is a big drawback. You example works because there is no cycle in the resulting graph of input words, however, this solution fails on cyclic graphs.
A way around that is to detect cycles and break them. Detection can be done this way:
if nx.recursive_simple_cycles(G):
print("CYCLES!!! /o\")
Breaking the cycle can be done by just dropping a random edge in the cycle and then you will randomly find the optimal solution for your problem (imagine a cycle with a tail, you should cut the cycle on the node having 3 edges), thus I suggest brute-forcing this part by trying all possible cycle breaks, computing longest path and taking the longest of the longest path. If you have multiple cycles it becomes a bit more explosive in number of possibilities... but hey it's NP-hard, at least the way I see it and I didn't plan to solve that now :)
Hope it helps
Here's a solution that doesn't require recursion. It uses the itertools permutation function to look at all possible orderings of the words, and find the one with the longest length. To save time, as soon as an ordering hits a word that doesn't work, it stops checking that ordering and moves on.
>>> g = 'girafarig eudino exeggcute omolga gabite'
... p = itertools.permutations(g.split())
... longestword = ""
... for words in p:
... thistry = words[0]
... # Concatenates words until the next word doesn't link with this one.
... for i in range(len(words) - 1):
... if words[i][-1] != words[i+1][0]:
... break
... thistry += words[i+1]
... i += 1
... if len(thistry) > len(longestword):
... longestword = thistry
... print(longestword)
... print("Final answer is {}".format(longestword))
girafarig
girafariggabiteeudino
girafariggabiteeudinoomolga
girafariggabiteexeggcuteeudinoomolga
Final answer is girafariggabiteexeggcuteeudinoomolga
First, let's see what the problem looks like:
from collections import defaultdict
import pydot
words = (
"audino bagon baltoy banette bidoof braviary bronzor carracosta "
"charmeleon cresselia croagunk darmanitan deino emboar emolga "
"exeggcute gabite girafarig gulpin haxorus"
).split()
def main():
# get first -> last letter transitions
nodes = set()
arcs = defaultdict(lambda: defaultdict(list))
for word in words:
first = word[0]
last = word[-1]
nodes.add(first)
nodes.add(last)
arcs[first][last].append(word)
# create a graph
graph = pydot.Dot("Word_combinations", graph_type="digraph")
# use letters as nodes
for node in sorted(nodes):
n = pydot.Node(node, shape="circle")
graph.add_node(n)
# use first-last as directed edges
for first, sub in arcs.items():
for last, wordlist in sub.items():
count = len(wordlist)
label = str(count) if count > 1 else ""
e = pydot.Edge(first, last, label=label)
graph.add_edge(e)
# save result
graph.write_jpg("g:/temp/wordgraph.png", prog="dot")
if __name__=="__main__":
main()
results in
which makes the solution fairly obvious (path shown in red), but only because the graph is acyclic (with the exception of two trivial self-loops).
I'm not a python expert, and I ran into this snippet of code which actually works and produces the correct answer, but I'm not sure I understand what happens in the second line:
for i in range(len(motifs[0])):
best = ''.join([motifs[j][i] for j in range(len(motifs))])
profile.append([(best.count(base)+1)/float(len(best)) for base in 'ACGT'])
I was trying to replace it with something like:
for i in range(len(motifs[0])):
for j in range(len(motifs)):
best =[motifs[j][i]]
profile.append([(best.count(base)+1)/float(len(best)) for base in 'ACGT'])
and also tried to break down the last line like this:
for i in range(len(motifs[0])):
for j in range(len(motifs)):
best =[motifs[j][i]]
for base in 'ACGT':
profile.append(best.count(base)+1)/float(len(best)
I tried some more variations but non of them worked.
My question is: What are those expressions (second and third line of first code) mean and how would you break it down to a few lines?
Thanks :)
''.join([motifs[j][i] for j in range(len(motifs))])
is idiomatically written
''.join(m[i] for m in motifs)
so it concatenates the i'th entry of all motifs, in order. Similarly,
[(best.count(bseq)+1)/float(len(seq)) for base in 'ACGT']
builds a list of (best.count(bseq)+1)/float(len(seq)) values for of ACGT; since the base variable doesn't actually occur, it's a list containing the same value four times and can be simplified to
[(best.count(bseq)+1) / float(len(seq))] * 4
for i in range(len(motifs[0])):
seq = ''.join([motifs[j][i] for j in range(len(motifs))])
profile.append([(best.count(bseq)+1)/float(len(seq)) for base in 'ACGT'])
is equivalent to:
for i in range(len(motifs[0])):
seq = ''
for j in range(len(motifs)):
seq += motifs[j][i]
profile.append([(best.count(bseq)+1)/float(len(seq)) for base in 'ACGT'])
which can be improved in countless ways.
For example:
seqs = [ ''.join(motif) for motif in motifs ]
bc = best.count(bseq)+1
profilte.extend([ map(lambda x: bc / float(len(x)),
seq) for base in 'ACGT' ] for seq in seqs)
correctness of which, I cannot test due to lack of input/output conditions.
Closest I got without being able to test it
for i, _ in enumerate(motifs[0]):
seq = ""
for m in motifs:
seq += m[i]
tmp = []
for base in "ACGT":
tmp.append(best.count(bseq) + 1 / float(len(seq)))
profile.append(tmp)
Okay, basically what I want is to compress a file by reusing code and then at runtime replace missing code. What I've come up with is really ugly and slow, at least it works. The problem is that the file has no specific structure, for example 'aGVsbG8=\n', as you can see it's base64 encoding. My function is really slow because the length of the file is 1700+ and it checks for patterns 1 character at the time. Please help me with new better code or at least help me with optimizing what I got :). Anything that helps is welcome! BTW i have already tried compression libraries but they didn't compress as good as my ugly function.
def c_long(inp, cap=False, b=5):
import re,string
if cap is False: cap = len(inp)
es = re.escape; le=len; ref = re.findall; ran = range; fi = string.find
c = b;inpc = inp;pattern = inpc[:b]; l=[]
rep = string.replace; ins = list.insert
while True:
if c == le(inpc) and le(inpc) > b+1: c = b; inpc = inpc[1:]; pattern = inpc[:b]
elif le(inpc) <= b: break
if c == cap: c = b; inpc = inpc[1:]; pattern = inpc[:b]
p = ref(es(pattern),inp)
pattern += inpc[c]
if le(p) > 1 and le(pattern) >= b+1:
if l == []: l = [[pattern,le(p)+le(pattern)]]
elif le(ref(es(inpc[:c+2]),inp))+le(inpc[:c+2]) < le(p)+le(pattern):
x = [pattern,le(p)+le(inpc[:c+1])]
for i in ran(le(l)):
if x[1] >= l[i][1] and x[0][:-1] not in l[i][0]: ins(l,i,x); break
elif x[1] >= l[i][1] and x[0][:-1] in l[i][0]: l[i] = x; break
inpc = inpc[:fi(inpc,x[0])] + inpc[le(x[0]):]
pattern = inpc[:b]
c = b-1
c += 1
d = {}; c = 0
s = ran(le(l))
for x in l: inp = rep(inp,x[0],'{%d}' % s[c]); d[str(s[c])] = x[0]; c += 1
return [inp,d]
def decompress(inp,l): return apply(inp.format, [l[str(x)] for x in sorted([int(x) for x in l.keys()])])
The easiest way to compress base64-encoded data is to first convert it to binary data -- this will already save 25 percent of the storage space:
>>> s = "YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXo=\n"
>>> t = s.decode("base64")
>>> len(s)
37
>>> len(t)
26
In most cases, you can compress the string even further using some compression algorithm, like t.encode("bz2") or t.encode("zlib").
A few remarks on your code: There are lots of factors that make the code hard to read: inconsistent spacing, overly long lines, meaningless variable names, unidiomatic code, etc. An example: Your decompress() function could be equivalently written as
def decompress(compressed_string, substitutions):
subst_list = [substitutions[k] for k in sorted(substitutions, key=int)]
return compressed_string.format(*subst_list)
Now it's already much more obvious what it does. You could go one step further: Why is substitutions a dictionary with the string keys "0", "1" etc.? Not only is it strange to use strings instead of integers -- you don't need the keys at all! A simple list will do, and decompress() will simplify to
def decompress(compressed_string, substitutions):
return compressed_string.format(*substitutions)
You might think all this is secondary, but if you make the rest of your code equally readable, you will find the bugs in your code yourself. (There are bugs -- it crashes for "abcdefgabcdefg" and many other strings.)
Typically one would pump the program through a compression algorithm optimized for text, then run that through exec, e.g.
code="""..."""
exec(somelib.decompress(code), globals=???, locals=???)
It may be the case that .pyc/.pyo files are compressed already, and one could check by creating one with x="""aaaaaaaa""", then increasing the length to x="""aaaaaaaaaaaaaaaaaaaaaaa...aaaa""" and seeing if the size changes appreciably.
I have mappings of "stems" and "endings" (may not be the correct words) that look like so:
all_endings = {
'birth': set(['place', 'day', 'mark']),
'snow': set(['plow', 'storm', 'flake', 'man']),
'shoe': set(['lace', 'string', 'maker']),
'lock': set(['down', 'up', 'smith']),
'crack': set(['down', 'up',]),
'arm': set(['chair']),
'high': set(['chair']),
'over': set(['charge']),
'under': set(['charge']),
}
But much longer, of course. I also made the corresponding dictionary the other way around:
all_stems = {
'chair': set(['high', 'arm']),
'charge': set(['over', 'under']),
'up': set(['lock', 'crack', 'vote']),
'down': set(['lock', 'crack', 'fall']),
'smith': set(['lock']),
'place': set(['birth']),
'day': set(['birth']),
'mark': set(['birth']),
'plow': set(['snow']),
'storm': set(['snow']),
'flake': set(['snow']),
'man': set(['snow']),
'lace': set(['shoe']),
'string': set(['shoe']),
'maker': set(['shoe']),
}
I've now tried to come up with an algorithm to find any match of two or more "stems" that match two or more "endings". Above, for example, it would match down and up with lock and crack, resulting in
lockdown
lockup
crackdown
crackup
But not including 'upvote', 'downfall' or 'locksmith' (and it's this that causes me the biggest problems). I get false positives like:
pancake
cupcake
cupboard
But I'm just going round in "loops". (Pun intended) and I don't seem to get anywhere. I'd appreciate any kick in the right direction.
Confused and useless code so far, which you probably should just ignore:
findings = defaultdict(set)
for stem, endings in all_endings.items():
# What stems have matching endings:
for ending in endings:
otherstems = all_stems[ending]
if not otherstems:
continue
for otherstem in otherstems:
# Find endings that also exist for other stems
otherendings = all_endings[otherstem].intersection(endings)
if otherendings:
# Some kind of match
findings[stem].add(otherstem)
# Go through this in order of what is the most stems that match:
MINMATCH = 2
for match in sorted(findings.values(), key=len, reverse=True):
for this_stem in match:
other_stems = set() # Stems that have endings in common with this_stem
other_endings = set() # Endings this stem have in common with other stems
this_endings = all_endings[this_stem]
for this_ending in this_endings:
for other_stem in all_stems[this_ending] - set([this_stem]):
matching_endings = this_endings.intersection(all_endings[other_stem])
if matching_endings:
other_endings.add(this_ending)
other_stems.add(other_stem)
stem_matches = all_stems[other_endings.pop()]
for other in other_endings:
stem_matches = stem_matches.intersection(all_stems[other])
if len(stem_matches) >= MINMATCH:
for m in stem_matches:
for e in all_endings[m]:
print(m+e)
It's not particularly pretty, but this is quite straightforward if you break your dictionary down into two lists, and use explicit indices:
all_stems = {
'chair' : set(['high', 'arm']),
'charge': set(['over', 'under']),
'fall' : set(['down', 'water', 'night']),
'up' : set(['lock', 'crack', 'vote']),
'down' : set(['lock', 'crack', 'fall']),
}
endings = all_stems.keys()
stem_sets = all_stems.values()
i = 0
for target_stem_set in stem_sets:
i += 1
j = 0
remaining_stems = stem_sets[i:]
for remaining_stem_set in remaining_stems:
j += 1
union = target_stem_set & remaining_stem_set
if len(union) > 1:
print "%d matches found" % len(union)
for stem in union:
print "%s%s" % (stem, endings[i-1])
print "%s%s" % (stem, endings[j+i-1])
Output:
$ python stems_and_endings.py
2 matches found
lockdown
lockup
crackdown
crackup
Basically all we're doing is iterating through each set in turn, and comparing it with every remaining set to see if there are more than two matches. We never have to try sets that fall earlier than the current set, because they've already been compared in a prior iteration. The rest (indexing, etc.) is just book-keeping.
I think that the way I avoid those false positives is by removing candidates with no words in the intersection of stems - If this make sense :(
Please have a look and please let me know if I am missing something.
#using all_stems and all_endings from the question
#this function is declared at the end of this answer
two_or_more_stem_combinations = get_stem_combinations(all_stems)
print "two_or_more_stem_combinations", two_or_more_stem_combinations
#this print shows ... [set(['lock', 'crack'])]
for request in two_or_more_stem_combinations:
#we filter the initial index to only look for sets or words in the request
candidates = filter(lambda x: x[0] in request, all_endings.items())
#intersection of the words for the request
words = candidates[0][1]
for c in candidates[1:]:
words=words.intersection(c[1])
#it's handy to have it in a dict
candidates = dict(candidates)
#we need to remove those that do not contain
#any words after the intersection of stems of all the candidates
candidates_to_remove = set()
for c in candidates.items():
if len(c[1].intersection(words)) == 0:
candidates_to_remove.add(c[0])
for key in candidates_to_remove:
del candidates[key]
#now we know what to combine
for c in candidates.keys():
print "combine", c , "with", words
Output :
combine lock with set(['down', 'up'])
combine crack with set(['down', 'up'])
As you can see this solution doesn't contain those false positives.
Edit: complexity
And the complexity of this solution doesn't get worst than O(3n) in the worst scenario - without taking into account accessing dictionaries. And
for most executions the first filter narrows down quite a lot the solution space.
Edit: getting the stems
This function basically explores recursively the dictionary all_stems and finds the combinations of two or more endings for which two or more stems coincide.
def get_stems_recursive(stems,partial,result,at_least=2):
if len(partial) >= at_least:
stem_intersect=all_stems[partial[0]]
for x in partial[1:]:
stem_intersect = stem_intersect.intersection(all_stems[x])
if len(stem_intersect) < 2:
return
result.append(stem_intersect)
for i in range(len(stems)):
remaining = stems[i+1:]
get_stems_recursive(remaining,partial + [stems[i][0]],result)
def get_stem_combinations(all_stems,at_least=2):
result = []
get_stems_recursive(all_stems.items(),list(),result)
return result
two_or_more_stem_combinations = get_stem_combinations(all_stems)
== Edited answer: ==
Well, here's another iteration for your consideration with the mistakes I made the first time addressed. Actually the result is code that is even shorter and simpler. The doc for combinations says that "if the input elements are unique, there will be no repeat values in each combination", so it should only be forming and testing the minimum number of intersections. It also appears that determining endings_by_stems isn't necessary.
from itertools import combinations
MINMATCH = 2
print 'all words with at least', MINMATCH, 'endings in common:'
for (word0,word1) in combinations(stems_by_endings, 2):
ending_words0 = stems_by_endings[word0]
ending_words1 = stems_by_endings[word1]
common_endings = ending_words0 & ending_words1
if len(common_endings) >= MINMATCH:
for stem in common_endings:
print ' ', stem+word0
print ' ', stem+word1
# all words with at least 2 endings in common:
# lockdown
# lockup
# falldown
# fallup
# crackdown
# crackup
== Previous answer ==
I haven't attempted much optimizing, but here's a somewhat brute-force -- but short -- approach that first calculates 'ending_sets' for each stem word, and then finds all the stem words that have common ending_sets with at least the specified minimum number of common endings.
In the final phase it prints out all the possible combinations of these stem + ending words it has detected that have meet the criteria. I tried to make all variable names as descriptive as possible to make it easy to follow. ;-) I've also left out the definitions of all_endings' and 'all+stems.
from collections import defaultdict
from itertools import combinations
ending_sets = defaultdict(set)
for stem in all_stems:
# create a set of all endings that have this as stem
for ending in all_endings:
if stem in all_endings[ending]:
ending_sets[stem].add(ending)
MINMATCH = 2
print 'all words with at least', MINMATCH, 'endings in common:'
for (word0,word1) in combinations(ending_sets, 2):
ending_words0 = ending_sets[word0]
ending_words1 = ending_sets[word1]
if len(ending_words0) >= MINMATCH and ending_words0 == ending_words1:
for stem in ending_words0:
print ' ', stem+word0
print ' ', stem+word1
# output
# all words with at least 2 endings in common:
# lockup
# lockdown
# crackup
# crackdown
If you represent your stemming relationships in a square binary arrays (where 1 means "x can follow y", for instance, and where other elements are set to 0), what you are trying to do is equivalent to looking for "broken rectangles" filled with ones:
... lock **0 crack **1 ...
... ...
down ... 1 0 1 1
up ... 1 1 1 1
... ...
Here, lock, crack, and **1 (example word) can be matched with down and up (but not word **0). The stemming relationships draw a 2x3 rectangle filled with ones.
Hope this helps!