Understanding another's text-mining function that removes similar strings

Understanding another's text-mining function that removes similar strings - python

I’m trying to replicate the methodology from this article, 538 Post about Most Repetitive Phrases, in which the author mined US presidential debate transcripts to determine the most repetitive phrases for each candidate.
I'm trying to implement this methodology with another dataset in R with the tm package.
Most of the code (GitHub repository) concerns mining the transcripts and assembling counts of each ngram, but I get lost at the prune_substrings() function code below:
def prune_substrings(tfidf_dicts, prune_thru=1000):
pruned = tfidf_dicts
for candidate in range(len(candidates)):
# growing list of n-grams in list form
so_far = []
ngrams_sorted = sorted(tfidf_dicts[candidate].items(), key=operator.itemgetter(1), reverse=True)[:prune_thru]
for ngram in ngrams_sorted:
# contained in a previous aka 'better' phrase
for better_ngram in so_far:
if overlap(list(better_ngram), list(ngram[0])):
#print "PRUNING!! "
#print list(better_ngram)
#print list(ngram[0])
pruned[candidate][ngram[0]] = 0
# not contained, so add to so_far to prevent future subphrases
else:
so_far += [list(ngram[0])]
return pruned
The input of the function, tfidf_dicts, is an array of dictionaries (one for each candidate) with ngrams as keys and tf-idf scores as values. For example, Trump's tf-idf dict begins like this:
trump.tfidf.dict = {'we don't win': 83.2, 'you have to': 72.8, ... }
so the structure of the input is like this:
tfidf_dicts = {trump.tfidf.dict, rubio.tfidf.dict, etc }
MY understanding is that prune_substrings does the following things, but I'm stuck on the else if clause, which is a pythonic thing I don't understand yet.
A. create list : pruned as tfidf_dicts; a list of tfidf dicts for each candidate
B loop through each candidate:
so_far = start an empty list of ngrams gone through so so_far
ngrams_sorted = sorted member's tf-idf dict from smallest to biggest
loop through each ngram in sorted
loop through each better_ngram in so_far
IF overlap b/w (below) == TRUE:
better_ngram (from so_far) and
ngram (from ngrams_sorted)
THEN zero out tf-idf for ngram
ELSE if (WHAT?!?)
add ngram to list, so_far
C. return pruned, i.e. list of unique ngrams sorted in order
Any help at all is much appreciated!

Note the indentation in your code... The else is lined up with the second for, not the if. This is a for-else construct, not an if-else.
In that case, the else is being used to initialize the inner loop, because it will be executed when so_far is empty the first time through, and each time the inner loop runs out of items to iterate through...
I am not sure that this is the most efficient way to achieve these comparisons, but conceptually you can get a sense of the flow with this snippet:
s=[]
for j in "ABCD":
for i in s:
print i,
else:
print "\nelse"
s.append(j)
Output:
else
A
else
A B
else
A B C
else
I would think that in R there is a much better way to do this than nested loops....

4 months later but here's my solution. I'm sure there is a more efficient solution, but for my purposes, it worked. The pythonic for-else doesn't translate to R. So the steps are different.
Take top n ngrams.
Create a list, t, where each element of the list is a logical vector of length n that says whether ngram in question overlaps all other ngrams (but fix 1:x to be false automatically)
Cbind together every element of t into a table, t2
Return only elements of t2 row sum is zero
set elements 1:n to FALSE (i.e. no overlap)
Ouala!
PrunedList Function
#' GetPrunedList
#'
#' takes a word freq df with columns Words and LenNorm, returns df of nonoverlapping strings
GetPrunedList <- function(wordfreqdf, prune_thru = 100) {
#take only first n items in list
tmp <- head(wordfreqdf, n = prune_thru) %>%
select(ngrams = Words, tfidfXlength = LenNorm)
#for each ngram in list:
t <- (lapply(1:nrow(tmp), function(x) {
#find overlap between ngram and all items in list (overlap = TRUE)
idx <- overlap(tmp[x, "ngrams"], tmp$ngrams)
#set overlap as false for itself and higher-scoring ngrams
idx[1:x] <- FALSE
idx
}))
#bind each ngram's overlap vector together to make a matrix
t2 <- do.call(cbind, t)
#find rows(i.e. ngrams) that do not overlap with those below
idx <- rowSums(t2) == 0
pruned <- tmp[idx,]
rownames(pruned) <- NULL
pruned
}
Overlap function
#' overlap
#' OBJ: takes two ngrams (as strings) and to see if they overlap
#' INPUT: a,b ngrams as strings
#' OUTPUT: TRUE if overlap
overlap <- function(a, b) {
max_overlap <- min(3, CountWords(a), CountWords(b))
a.beg <- word(a, start = 1L, end = max_overlap)
a.end <- word(a, start = -max_overlap, end = -1L)
b.beg <- word(b, start = 1L, end = max_overlap)
b.end <- word(b, start = -max_overlap, end = -1L)
# b contains a's beginning
w <- str_detect(b, coll(a.beg, TRUE))
# b contains a's end
x <- str_detect(b, coll(a.end, TRUE))
# a contains b's beginning
y <- str_detect(a, coll(b.beg, TRUE))
# a contains b's end
z <- str_detect(a, coll(b.end, TRUE))
#return TRUE if any of above are true
(w | x | y | z)
}

Related

Python: add ranges to a list of range while iterating over it

I encounter a problem and wish anyone could give me a tip to overcome it.
I have a 2D-python-list (83 rows and 3 column). The first 2 columns are the start and end positions for an interval. The 3rd column is a digit index (ex: 9.68). The list is reverse-sorted by the 3rd column.
I want to get all non-overlapping interval with the highest index.
Here is an example of the sorted list:
504 789 9.68
503 784 9.14
505 791 8.78
499 798 8.73
1024 1257 7.52
1027 1305 7.33
507 847 5.86
Here is what I tried:
# Define a function that test if 2 intervals overlap
def overlap(start1, end1, start2, end2):
return not (end1 < start2 or end2 < start1)
best_list = [] # Create a list that will store the best intervals
best_list.append([sort[0][0],sort[0][1]]) # Append the first interval of the sorted list
# Loop through the sorted list
for line in sort:
local_start, local_end = line.rsplit("\s",1)[0].split()
for i in range(len(best_list)):
best_start = best_list[i][0]
best_end = best_list[i][1]
test = overlap(int(best_start), int(best_end), int(local_start), int(local_end))
if test is False:
best_list.append([local_start, local_end])
And I get:
best_list = [(504, 789),(1024, 1257),(1027, 1305)]
But I want:
best_list = [(504, 789),(1024, 1257)]
Thanks!

Well, I have some question about your code. Since sort contains strings then this line append([sort[0][0],sort[0][1]]) does what do you expect?
Anyway, to the main part your problem is that when multiple elements exist in your list it is sufficient for just one of them to pass the overlap test to be added to the list (not what you want). E.g. when both (504, 789),(1024, 1257) exist then (1027, 1305) will be inserted to the list because it passed the test when it's compared to (504, 789).
So, I made a few changes and now it seems to work as expected:
best_list = [] # Create a list that will store the best intervals
best_list.append(sort[0].rsplit(" ", 1)[0].split()) # Append the first interval of the sorted list
# Loop through the sorted list
for line in sort:
local_start, local_end = line.rsplit("\s", 1)[0].split()
flag = False # <- flag to check the overall overlapping
for i in range(len(best_list)):
best_start = best_list[i][0]
best_end = best_list[i][1]
test = overlap(int(best_start), int(best_end), int(local_start), int(local_end))
print(test)
if test:
flag = False
break
flag = True
if flag:
best_list.append([local_start, local_end])
The main idea is to check for every element and if it passes all overlapping tests then add it (last line of my code code). Not before.

Suppose you parse your csv and already have a list with [(start, stop, index), ....] as [(int, int, float), ...] then you can sort it with the following:
from operator import itemgetter
data = sorted(data, key=itemgetter(2), reverse=True)
This means that you sort by third position and return the result in reverse order from max to min.
def nonoverlap(data):
result = [data[0]]
for cand in data[1:]:
start, stop, _ = cand
current_span = range(start, stop+1)
for item in result:
i, j, _ = item
span = range(i, j+1)
if (start in span) or (stop in span):
break
elif (i in current_span) or (j in current_span):
break
else:
result.append(cand)
return result
Then with the above function you will obtain the desired result. For the provided snippet you will obtain [(504, 789, 9.68), (1024, 1257, 7.52)]. I use here the fact that you can use 1 in range(0, 10) which will return True. While this is a naive implementation, you can use it as a starting point. If you want to return only starts and stops replace the return line with return [i[:2] for i in result].
Note: Also I want to add that your code has a logical mistake. You make a decision after each comparison, but must make a decision after you have compared to all the elements already present in yours best_list. That is why (504, 789) and (1027, 1305) passes your test, but should not. I wish this note will help you.

Word Ladder without replacement in python

I have question, where I need to implement ladder problem with different logic.
In each step, the player must either add one letter to the word
from the previous step, or take away one letter, and then rearrange the letters to make a new word.
croissant(-C) -> arsonist(-S) -> aroints(+E)->notaries(+B)->baritones(-S)->baritone
The new word should make sense from a wordList.txt which is dictionary of word.
Dictionary
My code look like this,
where I have calculated first the number of character removed "remove_list" and added "add_list". Then I have stored that value in the list.
Then I read the file, and stored into the dictionary which the sorted pair.
Then I started removing and add into the start word and matched with dictionary.
But now challenge is, some word after deletion and addition doesn't match with the dictionary and it misses the goal.
In that case, it should backtrack to previous step and should add instead of subtracting.
I am looking for some sort of recursive function, which could help in this or complete new logic which I could help to achieve the output.
Sample of my code.
start = 'croissant'
goal = 'baritone'
list_start = map(list,start)
list_goal = map(list, goal)
remove_list = [x for x in list_start if x not in list_goal]
add_list = [x for x in list_goal if x not in list_start]
file = open('wordList.txt','r')
dict_words = {}
for word in file:
strip_word = word.rstrip()
dict_words[''.join(sorted(strip_word))]=strip_word
file.close()
final_list = []
flag_remove = 0
for i in remove_list:
sorted_removed_list = sorted(start.replace(''.join(map(str, i)),"",1))
sorted_removed_string = ''.join(map(str, sorted_removed_list))
if sorted_removed_string in dict_words.keys():
print dict_words[sorted_removed_string]
final_list.append(sorted_removed_string)
flag_remove = 1
start = sorted_removed_string
print final_list
flag_add = 0
for i in add_list:
first_character = ''.join(map(str,i))
sorted_joined_list = sorted(''.join([first_character, final_list[-1]]))
sorted_joined_string = ''.join(map(str, sorted_joined_list))
if sorted_joined_string in dict_words.keys():
print dict_words[sorted_joined_string]
final_list.append(sorted_joined_string)
flag_add = 1
sorted_removed_string = sorted_joined_string

Recursion-based backtracking isn't a good idea for search problem of this sort. It blindly goes downward in search tree, without exploiting the fact that words are almost never 10-12 distance away from each other, causing StackOverflow (or recursion limit exceeded in Python).
The solution here uses breadth-first search. It uses mate(s) as helper, which given a word s, finds all possible words we can travel to next. mate in turn uses a global dictionary wdict, pre-processed at the beginning of the program, which for a given word, finds all it's anagrams (i.e re-arrangement of letters).
from queue import Queue
words = set(''.join(s[:-1]) for s in open("wordsEn.txt"))
wdict = {}
for w in words:
s = ''.join(sorted(w))
if s in wdict: wdict[s].append(w)
else: wdict[s] = [w]
def mate(s):
global wdict
ans = [''.join(s[:c]+s[c+1:]) for c in range(len(s))]
for c in range(97,123): ans.append(s + chr(c))
for m in ans: yield from wdict.get(''.join(sorted(m)),[])
def bfs(start,goal,depth=0):
already = set([start])
prev = {}
q = Queue()
q.put(start)
while not q.empty():
cur = q.get()
if cur==goal:
ans = []
while cur: ans.append(cur);cur = prev.get(cur)
return ans[::-1] #reverse the array
for m in mate(cur):
if m not in already:
already.add(m)
q.put(m)
prev[m] = cur
print(bfs('croissant','baritone'))
which outputs: ['croissant', 'arsonist', 'rations', 'senorita', 'baritones', 'baritone']

Python word game. Last letter of first word == first letter of second word. Find longest possible sequence of words

I'm trying to write a program that mimics a word game where, from a given set of words, it will find the longest possible sequence of words. No word can be used twice.
I can do the matching letters and words up, and storing them into lists, but I'm having trouble getting my head around how to handle the potentially exponential number of possibilities of words in lists. If word 1 matches word 2 and then I go down that route, how do I then back up to see if words 3 or 4 match up with word one and then start their own routes, all stemming from the first word?
I was thinking some way of calling the function inside itself maybe?
I know it's nowhere near doing what I need it to do, but it's a start. Thanks in advance for any help!
g = "audino bagon baltoy banette bidoof braviary bronzor carracosta charmeleon cresselia croagunk darmanitan deino emboar emolga exeggcute gabite girafarig gulpin haxorus"
def pokemon():
count = 1
names = g.split()
first = names[count]
master = []
for i in names:
print (i, first, i[0], first[-1])
if i[0] == first[-1] and i not in master:
master.append(i)
count += 1
first = i
print ("success", master)
if len(master) == 0:
return "Pokemon", first, "does not work"
count += 1
first = names[count]
pokemon()

Your idea of calling a function inside of itself is a good one. We can solve this with recursion:
def get_neighbors(word, choices):
return set(x for x in choices if x[0] == word[-1])
def longest_path_from(word, choices):
choices = choices - set([word])
neighbors = get_neighbors(word, choices)
if neighbors:
paths = (longest_path_from(w, choices) for w in neighbors)
max_path = max(paths, key=len)
else:
max_path = []
return [word] + max_path
def longest_path(choices):
return max((longest_path_from(w, choices) for w in choices), key=len)
Now we just define our word list:
words = ("audino bagon baltoy banette bidoof braviary bronzor carracosta "
"charmeleon cresselia croagunk darmanitan deino emboar emolga "
"exeggcute gabite girafarig gulpin haxorus")
words = frozenset(words.split())
Call longest_path with a set of words:
>>> longest_path(words)
['girafarig', 'gabite', 'exeggcute', 'emolga', 'audino']
A couple of things to know: as you point out, this has exponential complexity, so beware! Also, know that python has a recursion limit!

Using some black magic and graph theory I found a partial solution that might be good (not thoroughly tested).
The idea is to map your problem into a graph problem rather than a simple iterative problem (although it might work too!). So I defined the nodes of the graph to be the first letters and last letters of your words. I can only create edges between nodes of type first and last. I cannot map node first number X to node last number X (a word cannot be followed by it self). And from that your problem is just the same as the Longest path problem which tends to be NP-hard for general case :)
By taking some information here: stackoverflow-17985202 I managed to write this:
g = "audino bagon baltoy banette bidoof braviary bronzor carracosta charmeleon cresselia croagunk darmanitan deino emboar emolga exeggcute gabite girafarig gulpin haxorus"
words = g.split()
begin = [w[0] for w in words] # Nodes first
end = [w[-1] for w in words] # Nodes last
links = []
for i, l in enumerate(end): # Construct edges
ok = True
offset = 0
while ok:
try:
bl = begin.index(l, offset)
if i != bl: # Cannot map to self
links.append((i, bl))
offset = bl + 1 # next possible edge
except ValueError: # no more possible edge for this last node, Next!
ok = False
# Great function shamelessly taken from stackoverflow (link provided above)
import networkx as nx
def longest_path(G):
dist = {} # stores [node, distance] pair
for node in nx.topological_sort(G):
# pairs of dist,node for all incoming edges
pairs = [(dist[v][0]+1,v) for v in G.pred[node]]
if pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
node,(length,_) = max(dist.items(), key=lambda x:x[1])
path = []
while length > 0:
path.append(node)
length,node = dist[node]
return list(reversed(path))
# Construct graph
G = nx.DiGraph()
G.add_edges_from(links)
# TADAAAA!
print(longest_path(G))
Although it looks nice, there is a big drawback. You example works because there is no cycle in the resulting graph of input words, however, this solution fails on cyclic graphs.
A way around that is to detect cycles and break them. Detection can be done this way:
if nx.recursive_simple_cycles(G):
print("CYCLES!!! /o\")
Breaking the cycle can be done by just dropping a random edge in the cycle and then you will randomly find the optimal solution for your problem (imagine a cycle with a tail, you should cut the cycle on the node having 3 edges), thus I suggest brute-forcing this part by trying all possible cycle breaks, computing longest path and taking the longest of the longest path. If you have multiple cycles it becomes a bit more explosive in number of possibilities... but hey it's NP-hard, at least the way I see it and I didn't plan to solve that now :)
Hope it helps

Here's a solution that doesn't require recursion. It uses the itertools permutation function to look at all possible orderings of the words, and find the one with the longest length. To save time, as soon as an ordering hits a word that doesn't work, it stops checking that ordering and moves on.
>>> g = 'girafarig eudino exeggcute omolga gabite'
... p = itertools.permutations(g.split())
... longestword = ""
... for words in p:
... thistry = words[0]
... # Concatenates words until the next word doesn't link with this one.
... for i in range(len(words) - 1):
... if words[i][-1] != words[i+1][0]:
... break
... thistry += words[i+1]
... i += 1
... if len(thistry) > len(longestword):
... longestword = thistry
... print(longestword)
... print("Final answer is {}".format(longestword))
girafarig
girafariggabiteeudino
girafariggabiteeudinoomolga
girafariggabiteexeggcuteeudinoomolga
Final answer is girafariggabiteexeggcuteeudinoomolga

First, let's see what the problem looks like:
from collections import defaultdict
import pydot
words = (
"audino bagon baltoy banette bidoof braviary bronzor carracosta "
"charmeleon cresselia croagunk darmanitan deino emboar emolga "
"exeggcute gabite girafarig gulpin haxorus"
).split()
def main():
# get first -> last letter transitions
nodes = set()
arcs = defaultdict(lambda: defaultdict(list))
for word in words:
first = word[0]
last = word[-1]
nodes.add(first)
nodes.add(last)
arcs[first][last].append(word)
# create a graph
graph = pydot.Dot("Word_combinations", graph_type="digraph")
# use letters as nodes
for node in sorted(nodes):
n = pydot.Node(node, shape="circle")
graph.add_node(n)
# use first-last as directed edges
for first, sub in arcs.items():
for last, wordlist in sub.items():
count = len(wordlist)
label = str(count) if count > 1 else ""
e = pydot.Edge(first, last, label=label)
graph.add_edge(e)
# save result
graph.write_jpg("g:/temp/wordgraph.png", prog="dot")
if __name__=="__main__":
main()
results in
which makes the solution fairly obvious (path shown in red), but only because the graph is acyclic (with the exception of two trivial self-loops).

translate my sequence?

I have to write a script to translate this sequence:
dict = {"TTT":"F|Phe","TTC":"F|Phe","TTA":"L|Leu","TTG":"L|Leu","TCT":"S|Ser","TCC":"S|Ser",
"TCA":"S|Ser","TCG":"S|Ser", "TAT":"Y|Tyr","TAC":"Y|Tyr","TAA":"*|Stp","TAG":"*|Stp",
"TGT":"C|Cys","TGC":"C|Cys","TGA":"*|Stp","TGG":"W|Trp", "CTT":"L|Leu","CTC":"L|Leu",
"CTA":"L|Leu","CTG":"L|Leu","CCT":"P|Pro","CCC":"P|Pro","CCA":"P|Pro","CCG":"P|Pro",
"CAT":"H|His","CAC":"H|His","CAA":"Q|Gln","CAG":"Q|Gln","CGT":"R|Arg","CGC":"R|Arg",
"CGA":"R|Arg","CGG":"R|Arg", "ATT":"I|Ile","ATC":"I|Ile","ATA":"I|Ile","ATG":"M|Met",
"ACT":"T|Thr","ACC":"T|Thr","ACA":"T|Thr","ACG":"T|Thr", "AAT":"N|Asn","AAC":"N|Asn",
"AAA":"K|Lys","AAG":"K|Lys","AGT":"S|Ser","AGC":"S|Ser","AGA":"R|Arg","AGG":"R|Arg",
"GTT":"V|Val","GTC":"V|Val","GTA":"V|Val","GTG":"V|Val","GCT":"A|Ala","GCC":"A|Ala",
"GCA":"A|Ala","GCG":"A|Ala", "GAT":"D|Asp","GAC":"D|Asp","GAA":"E|Glu",
"GAG":"E|Glu","GGT":"G|Gly","GGC":"G|Gly","GGA":"G|Gly","GGG":"G|Gly"}
seq = "TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA"
a=""
for y in range( 0, len ( seq)):
c=(seq[y:y+3])
#print(c)
for k, v in dict.items():
if seq[y:y+3] == k:
alle_amino = v[::3] #alle aminozuren op rijtje, a1.1 -a2.1- a.3.1-a1.2 enzo
print (v)
With this script I get the amino acids from the 3 frames under each other, but how can I sort this and get all the amino acids from frame 1 next to each other, and all the amino acids from frame 2 next to each other, and the same for frame 3?
for example , my results must be :
+3 SerIleLeuAlaStpProLysTrpGluProProTyrValAlaStpProIleTyrIleTyrTle
+2 PheAsnThrSerMetThrLysValGlyThrProLeuArgSerMetThrHisIleTyrIleTyr
+1 PheGlnTyrStpHisAspGlnSerGlyAsnProLeuThrStpHisAspProTyrIleTyrIle
TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA
I use Python 3.
i had one more question : can i make this results by some changes in mine own script ?

You can use (Note this would be ridiculously much more easier using biopython translate method):
dictio = {your dictionary here}
def translate(seq):
x = 0
aaseq = []
while True:
try:
aaseq.append(dicti[seq[x:x+3]])
x += 3
except (IndexError, KeyError):
break
return aaseq
seq = "TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA"
for frame in range(3):
print('+%i' %(frame+1), ''.join(item.split('|')[1] for item in translate(seq[frame:])))
Note I changed the name of your dictionary with dicti (not to overwrite dict).
Some comments to help you understand:
translate takes you sequence and returns it in the form of a list in which each item corresponds to the amino acid translation of the triplet coding that position. Like:
aaseq = ["L|Leu","L|Leu","P|Pro", ....]
you could process more this data (get only one or three letters code) inside translate or return it as it is to be processed latter as I have done.
translate is called in
''.join(item.split('|')[1] for item in translate(seq[frame:]))
for each frame. For frame value being 0, 1 or 2 it sends seq[frame:] as a parameter to translate. That is, you are sending the sequences corresponding to the three different reading frames processing them in series. Then, in
''.join(item.split('|')[1]
I split the one and three-letters codes for each amino acid and take the one at index 1 (the second). Then they are joined in a single string

Not too pretty, but does what you want
dct = {"TTT":"F|Phe","TTC":"F|Phe","TTA":"L|Leu","TTG":"L|Leu","TCT":"S|Ser","TCC":"S|Ser",
"TCA":"S|Ser","TCG":"S|Ser", "TAT":"Y|Tyr","TAC":"Y|Tyr","TAA":"*|Stp","TAG":"*|Stp",
"TGT":"C|Cys","TGC":"C|Cys","TGA":"*|Stp","TGG":"W|Trp", "CTT":"L|Leu","CTC":"L|Leu",
"CTA":"L|Leu","CTG":"L|Leu","CCT":"P|Pro","CCC":"P|Pro","CCA":"P|Pro","CCG":"P|Pro",
"CAT":"H|His","CAC":"H|His","CAA":"Q|Gln","CAG":"Q|Gln","CGT":"R|Arg","CGC":"R|Arg",
"CGA":"R|Arg","CGG":"R|Arg", "ATT":"I|Ile","ATC":"I|Ile","ATA":"I|Ile","ATG":"M|Met",
"ACT":"T|Thr","ACC":"T|Thr","ACA":"T|Thr","ACG":"T|Thr", "AAT":"N|Asn","AAC":"N|Asn",
"AAA":"K|Lys","AAG":"K|Lys","AGT":"S|Ser","AGC":"S|Ser","AGA":"R|Arg","AGG":"R|Arg",
"GTT":"V|Val","GTC":"V|Val","GTA":"V|Val","GTG":"V|Val","GCT":"A|Ala","GCC":"A|Ala",
"GCA":"A|Ala","GCG":"A|Ala", "GAT":"D|Asp","GAC":"D|Asp","GAA":"E|Glu",
"GAG":"E|Glu","GGT":"G|Gly","GGC":"G|Gly","GGA":"G|Gly","GGG":"G|Gly"}
seq = "TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA"
def get_amino_list(s):
for y in range(3):
yield [s[x:x+3] for x in range(y, len(s) - 2, 3)]
for n, amn in enumerate(get_amino_list(seq), 1):
print ("+%d " % n + "".join(dct[x][2:] for x in amn))
print(seq)

Here's my solution. I've called your "dict" variable "aminos". The function method3 returns a list of the values to the right of the "|". To merge them into a single string, just join them on "".
From looking at your code, I believe that your aminos dict contains all possible three-letter combinations. Therefore, I've removed the checks that verify this. It should run a lot faster as a result.
def overlapping_groups(seq, group_len=3):
"""Returns `N` adjacent items from an iterable in a sliding window style
"""
for i in range(len(seq)-group_len):
yield seq[i:i+group_len]
def method3(seq, aminos):
return [aminos[k][2:] for k in overlapping_groups(seq, 3)]
for i in range(3):
print("%d: %s" % (i, "".join(method3(seq[i:], aminos))))

Finding combinations of stems and endings

I have mappings of "stems" and "endings" (may not be the correct words) that look like so:
all_endings = {
'birth': set(['place', 'day', 'mark']),
'snow': set(['plow', 'storm', 'flake', 'man']),
'shoe': set(['lace', 'string', 'maker']),
'lock': set(['down', 'up', 'smith']),
'crack': set(['down', 'up',]),
'arm': set(['chair']),
'high': set(['chair']),
'over': set(['charge']),
'under': set(['charge']),
}
But much longer, of course. I also made the corresponding dictionary the other way around:
all_stems = {
'chair': set(['high', 'arm']),
'charge': set(['over', 'under']),
'up': set(['lock', 'crack', 'vote']),
'down': set(['lock', 'crack', 'fall']),
'smith': set(['lock']),
'place': set(['birth']),
'day': set(['birth']),
'mark': set(['birth']),
'plow': set(['snow']),
'storm': set(['snow']),
'flake': set(['snow']),
'man': set(['snow']),
'lace': set(['shoe']),
'string': set(['shoe']),
'maker': set(['shoe']),
}
I've now tried to come up with an algorithm to find any match of two or more "stems" that match two or more "endings". Above, for example, it would match down and up with lock and crack, resulting in
lockdown
lockup
crackdown
crackup
But not including 'upvote', 'downfall' or 'locksmith' (and it's this that causes me the biggest problems). I get false positives like:
pancake
cupcake
cupboard
But I'm just going round in "loops". (Pun intended) and I don't seem to get anywhere. I'd appreciate any kick in the right direction.
Confused and useless code so far, which you probably should just ignore:
findings = defaultdict(set)
for stem, endings in all_endings.items():
# What stems have matching endings:
for ending in endings:
otherstems = all_stems[ending]
if not otherstems:
continue
for otherstem in otherstems:
# Find endings that also exist for other stems
otherendings = all_endings[otherstem].intersection(endings)
if otherendings:
# Some kind of match
findings[stem].add(otherstem)
# Go through this in order of what is the most stems that match:
MINMATCH = 2
for match in sorted(findings.values(), key=len, reverse=True):
for this_stem in match:
other_stems = set() # Stems that have endings in common with this_stem
other_endings = set() # Endings this stem have in common with other stems
this_endings = all_endings[this_stem]
for this_ending in this_endings:
for other_stem in all_stems[this_ending] - set([this_stem]):
matching_endings = this_endings.intersection(all_endings[other_stem])
if matching_endings:
other_endings.add(this_ending)
other_stems.add(other_stem)
stem_matches = all_stems[other_endings.pop()]
for other in other_endings:
stem_matches = stem_matches.intersection(all_stems[other])
if len(stem_matches) >= MINMATCH:
for m in stem_matches:
for e in all_endings[m]:
print(m+e)

It's not particularly pretty, but this is quite straightforward if you break your dictionary down into two lists, and use explicit indices:
all_stems = {
'chair' : set(['high', 'arm']),
'charge': set(['over', 'under']),
'fall' : set(['down', 'water', 'night']),
'up' : set(['lock', 'crack', 'vote']),
'down' : set(['lock', 'crack', 'fall']),
}
endings = all_stems.keys()
stem_sets = all_stems.values()
i = 0
for target_stem_set in stem_sets:
i += 1
j = 0
remaining_stems = stem_sets[i:]
for remaining_stem_set in remaining_stems:
j += 1
union = target_stem_set & remaining_stem_set
if len(union) > 1:
print "%d matches found" % len(union)
for stem in union:
print "%s%s" % (stem, endings[i-1])
print "%s%s" % (stem, endings[j+i-1])
Output:
$ python stems_and_endings.py
2 matches found
lockdown
lockup
crackdown
crackup
Basically all we're doing is iterating through each set in turn, and comparing it with every remaining set to see if there are more than two matches. We never have to try sets that fall earlier than the current set, because they've already been compared in a prior iteration. The rest (indexing, etc.) is just book-keeping.

I think that the way I avoid those false positives is by removing candidates with no words in the intersection of stems - If this make sense :(
Please have a look and please let me know if I am missing something.
#using all_stems and all_endings from the question
#this function is declared at the end of this answer
two_or_more_stem_combinations = get_stem_combinations(all_stems)
print "two_or_more_stem_combinations", two_or_more_stem_combinations
#this print shows ... [set(['lock', 'crack'])]
for request in two_or_more_stem_combinations:
#we filter the initial index to only look for sets or words in the request
candidates = filter(lambda x: x[0] in request, all_endings.items())
#intersection of the words for the request
words = candidates[0][1]
for c in candidates[1:]:
words=words.intersection(c[1])
#it's handy to have it in a dict
candidates = dict(candidates)
#we need to remove those that do not contain
#any words after the intersection of stems of all the candidates
candidates_to_remove = set()
for c in candidates.items():
if len(c[1].intersection(words)) == 0:
candidates_to_remove.add(c[0])
for key in candidates_to_remove:
del candidates[key]
#now we know what to combine
for c in candidates.keys():
print "combine", c , "with", words
Output :
combine lock with set(['down', 'up'])
combine crack with set(['down', 'up'])
As you can see this solution doesn't contain those false positives.
Edit: complexity
And the complexity of this solution doesn't get worst than O(3n) in the worst scenario - without taking into account accessing dictionaries. And
for most executions the first filter narrows down quite a lot the solution space.
Edit: getting the stems
This function basically explores recursively the dictionary all_stems and finds the combinations of two or more endings for which two or more stems coincide.
def get_stems_recursive(stems,partial,result,at_least=2):
if len(partial) >= at_least:
stem_intersect=all_stems[partial[0]]
for x in partial[1:]:
stem_intersect = stem_intersect.intersection(all_stems[x])
if len(stem_intersect) < 2:
return
result.append(stem_intersect)
for i in range(len(stems)):
remaining = stems[i+1:]
get_stems_recursive(remaining,partial + [stems[i][0]],result)
def get_stem_combinations(all_stems,at_least=2):
result = []
get_stems_recursive(all_stems.items(),list(),result)
return result
two_or_more_stem_combinations = get_stem_combinations(all_stems)

== Edited answer: ==
Well, here's another iteration for your consideration with the mistakes I made the first time addressed. Actually the result is code that is even shorter and simpler. The doc for combinations says that "if the input elements are unique, there will be no repeat values in each combination", so it should only be forming and testing the minimum number of intersections. It also appears that determining endings_by_stems isn't necessary.
from itertools import combinations
MINMATCH = 2
print 'all words with at least', MINMATCH, 'endings in common:'
for (word0,word1) in combinations(stems_by_endings, 2):
ending_words0 = stems_by_endings[word0]
ending_words1 = stems_by_endings[word1]
common_endings = ending_words0 & ending_words1
if len(common_endings) >= MINMATCH:
for stem in common_endings:
print ' ', stem+word0
print ' ', stem+word1
# all words with at least 2 endings in common:
# lockdown
# lockup
# falldown
# fallup
# crackdown
# crackup
== Previous answer ==
I haven't attempted much optimizing, but here's a somewhat brute-force -- but short -- approach that first calculates 'ending_sets' for each stem word, and then finds all the stem words that have common ending_sets with at least the specified minimum number of common endings.
In the final phase it prints out all the possible combinations of these stem + ending words it has detected that have meet the criteria. I tried to make all variable names as descriptive as possible to make it easy to follow. ;-) I've also left out the definitions of all_endings' and 'all+stems.
from collections import defaultdict
from itertools import combinations
ending_sets = defaultdict(set)
for stem in all_stems:
# create a set of all endings that have this as stem
for ending in all_endings:
if stem in all_endings[ending]:
ending_sets[stem].add(ending)
MINMATCH = 2
print 'all words with at least', MINMATCH, 'endings in common:'
for (word0,word1) in combinations(ending_sets, 2):
ending_words0 = ending_sets[word0]
ending_words1 = ending_sets[word1]
if len(ending_words0) >= MINMATCH and ending_words0 == ending_words1:
for stem in ending_words0:
print ' ', stem+word0
print ' ', stem+word1
# output
# all words with at least 2 endings in common:
# lockup
# lockdown
# crackup
# crackdown

If you represent your stemming relationships in a square binary arrays (where 1 means "x can follow y", for instance, and where other elements are set to 0), what you are trying to do is equivalent to looking for "broken rectangles" filled with ones:
... lock **0 crack **1 ...
... ...
down ... 1 0 1 1
up ... 1 1 1 1
... ...
Here, lock, crack, and **1 (example word) can be matched with down and up (but not word **0). The stemming relationships draw a 2x3 rectangle filled with ones.
Hope this helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understanding another's text-mining function that removes similar strings - python

Related

Python: add ranges to a list of range while iterating over it

Word Ladder without replacement in python

Python word game. Last letter of first word == first letter of second word. Find longest possible sequence of words

translate my sequence?

Finding combinations of stems and endings

Categories

Resources