How to optimize comparing combinations in a python list - python

I have a list of 16,000 lists. Each sublist contains a tuple and a score, like this:
mylist = [
[('only','one','time'),10.54],
[('one','time'),3.21],
[('red','hot','chili','peppers'),0.223],
[('red','hot','chili'),1.98]
]
My goal is to iterate through combinations in mylist and remove any element when a superset or subset is detected. The element to be removed is based on the lowest score between the two. So in this example, I want to remove
[('one','time'),3.21],
[('red','hot','chili','peppers'),0.223]
because ('one','time') is a subset of ('only','one','time') and between the two, ('one','time') has the lowest score 10.54>3.21.
('red','hot','chili','peppers') is a superset of ('red','hot','chili') and between the two, 0.223<1.98
My initial solution was a brute force - get every possible combination from the list choose 2, then compare the tuples for subsets using the all() function, then drop the items with the min() score.
This performs poorly due the to number of combinations to search:
from itertools import combinations
removelist = []
for x,y in combinations(mylist,2):
if (all(word in x[0] for word in y[0]) or all(word in y[0] for word in x[0])):
smallest = min([x,y],key=itemgetter(1))
removelist.append(smallest)
removelist = set(removelist)
outlist = [x for x in mylist if x not in removelist]
return outlist
returns:
outlist = [
[('only','one','time'),10.54],
[('red','hot','chili'),1.98]
]
So for a list of ~16,000 sublists, this would be roughly:
combinations = n! / (r! * (n-r)!)
combinations = 16,000! / (2! * (15998)!)
combinations = 16,000 * 15999 / 2
combinations = 127,992,000
Is there a smarter way to do this, reducing the 127 million items I need to check?

This might be a thousand times faster than yours. First I convert the word tuples to sets for simpler and faster subset checks, like #Alexander. Then I sort by set size, so I don't have to check for superset. (Because if |A| ≤ |B|, then the only way A is a superset of B is if it is B, in which case it is also a subset of B).
And then comes my main trick. Let's say we have the word set {'red','hot','chili'}, and we want to find the word sets of which it is a subset. Do we need to check all other (larger or equal-size) sets? No. It suffices to check only those sets that contain the word 'red'. Or only those with 'hot'. Or only those with 'chili'. Let's take the rarest word, i.e., the one in the fewest sets (in this case I'd guess 'chili').
I decided to call your lists "songs", makes it nice to talk about them.
from collections import defaultdict
def process_list_stefan(mylist):
# Change to sets, attach the index, and sort by number of words (many to few)
songs = [(i, set(words), score) for i, (words, score) in enumerate(mylist)]
songs.sort(key=lambda song: -len(song[1]))
# Check songs against others, identify the ones to remove
remove = set()
songs_with_word = defaultdict(list)
for song in songs:
i, words1, score1 = song
# Pick the song's rarest word
word = min(words1, key=lambda word: len(songs_with_word[word]))
# Go through songs containing that word
for j, words2, score2 in songs_with_word[word]:
if words1 <= words2:
# Lower score loses. In case of tie, lower index loses.
remove.add(min((score1, i), (score2, j))[1])
# Make this song available as superset candidate
for word in words1:
songs_with_word[word].append(song)
# Apply the removals
return [song for i, song in enumerate(mylist) if i not in remove]
Update: Actually, instead of just using the song's rarest word and going through all its "supersets" (sets containing that word), consider all words in the current song and use the intersection of their "supersets". In my testing with made up data, it's even faster by about factor 1.6:
from collections import defaultdict
def process_list_stefan(mylist):
# Change to sets, attach the index, and sort by number of words (many to few)
songs = [(i, set(words), score) for i, (words, score) in enumerate(mylist)]
songs.sort(key=lambda song: -len(song[1]))
# Helper: Intersection of sets
def intersect(sets):
s = next(sets).copy()
for t in sets:
s &= t
return s
# Check songs against others, identify the ones to remove
remove = set()
songs_with_word = defaultdict(set)
for song in songs:
i, words1, score1 = song
for j in intersect(songs_with_word[word] for word in words1):
# Lower score loses. In case of tie, lower index loses.
remove.add(min((score1, i), (mylist[j][1], j))[1])
# Make this song available as superset candidate
for word in words1:
songs_with_word[word].add(i)
# Apply the removals
return [song for i, song in enumerate(mylist) if i not in remove]

First, create a new list that retains the original scores but converts the tuples of words into sets for faster comparisons and set membership testing.
Enumerate through each set of words and scores in this new list and compare against all remaining sets and scores. Using sets, we can detect subsets via s1.issubset(s2) and supersets via s1.issuperset(s2).
Once the subset/superset has been detected, we compare scores. If the current record has a higher score, we mark the other for removal and then continue comparing against the remaining records. Otherwise, we add the current index location to a set of indices to be subsequently removed and continue any remaining comparisons against this record.
Once we have processed all of the records, we use a conditional list comprehension to create a new list of all records to keep.
In terms of subset comparisons, worst case time complexity is O(n^2) / 2, which is still O(n^2). Of course, each subset comparison has its own time complexity based on the number of unique words in each sublist. This solution thus makes the same number of comparisons as the OP's for x,y in combinations(mylist,2) method, but the subset/superset comparisons are done using sets rather than lists. As a result, this method should still be significantly faster.
def process_list(my_list):
# Convert tuples to sets.
my_sets = [(set(tuples), score) for tuples, score in my_list]
idx_to_remove = set()
for i, (subset1, score1) in enumerate(my_sets):
for j, (subset2, score2) in enumerate(my_sets[(i + 1):], start=i + 1):
if subset1.issubset(subset2) | subset1.issuperset(subset2):
# Subset/Superset detected.
idx_to_remove.add(i if score1 < score2 else j)
# Remove filtered items from list and return filtered list.
return [tup for n, tup in enumerate(my_list) if n not in idx_to_remove]
# TEST CASES
# Case 1.
mylist = [
[('only','one','time'), 10.54],
[('one','time'), 3.21],
[('red','hot','chili','peppers'), 0.223],
[('red','hot','chili'), 1.98],
]
>>> process_list(mylist)
[[('only', 'one', 'time'), 10.54], [('red', 'hot', 'chili'), 1.98]]
# Case 2.
# ('a', 'b', 'd') is superset of ('a', 'b') and has a lower score, so remove former.
# ('a', 'b') is a subset of ('a', 'b', 'c') and has a lower score, so remove former.
mylist = [[('a', 'b', 'c'), 3], [('a', 'b', 'd'), 1], [('a', 'b'), 2]]
>>> process_list(mylist)
[[('a', 'b', 'c'), 3]]
# Case 3. Same items as Case 2, but different order. Same logic as Case 2.
mylist = [[('a', 'b'), 2], [('a', 'b', 'c'), 3], [('a', 'b', 'd'), 1]]
>>> process_list(mylist)
[[('a', 'b', 'c'), 3]]
# Case 4.
# ('a', 'b', 'c') is a superset of ('a', 'b') and has a lower score, so remove former.
# ('d','c') is a subset of ('d','c','w') and has a lower score, so remove former.
mylist = [[('a', 'b'), 2], [('a', 'b', 'c'), 1], [('d','c','w'), 4], [('d','c'), 2]]
>>> process_list(mylist)
[[('a', 'b'), 2], [('d', 'c', 'w'), 4]]

Related

How to use enumerate in a list comprehension with two lists?

I just started to use list comprehension and I'm struggling with it. In this case, I need to get the n number of each list (sequence_0 and sequence_1) that the iteration is at each time. How can I do that?
The idea is to get the longest sequence of equal nucleotides (a motif) between the two sequences. Once a pair is finded, the program should continue in the nexts nucleotides of the sequences, checking if they are also equal and then elonganting the motif with it. The final output should be an list of all the motifs finded.
The problem is, to continue in the next nucleotides once a pair is finded, i need the position of the pair in both sequences to the program continue. The index function does not work in this case, and that's why i need the enumerate.
Also, I don't understand exactly the reason for the x and y between (), it would be good to understand that too :)
just to explain, the content of the lists is DNA sequences, so its basically something like:
sequence_1 = ['A', 'T', 'C', 'A', 'C']
def find_shared_motif(arq):
data = fastaread(arq)
seqs = [list(sequence) for sequence in data.values()]
motifs = [[]]
i = 0
sequence_0, sequence_1 = seqs[0], seqs[1] # just to simplify
for x, y in [(x, y) for x in zip(sequence_0[::], sequence_0[1::]) for y in zip(sequence_1[::], sequence_1[1::])]:
print(f'Pairs {"".join(x)} and {"".join(y)} being analyzed...')
if x == y:
print(f'Pairs {"".join(x)} and {"".join(y)} match!')
motifs[i].append(x[0]), motifs[i].append(x[1])
k = sequence_0.index(x[0]) + 2 # NAO ESTA DEVOLVENDO O NUMERO CERTO
u = sequence_1.index(y[0]) + 2
print(k, u)
# Determines if the rest of the sequence is compatible
print(f'Starting to elongate the motif {x}...')
for j, m in enumerate(sequence_1[u::]):
try:
# Checks if the nucleotide is equal for both of the sequences
print(f'Analyzing the pair {sequence_0[k + j]}, {m}')
if m == sequence_0[k + j]:
motifs[i].append(m)
print(f'The pair {sequence_0[k + j]}, {m} is equal!')
# Stop in the first nonequal residue
else:
print(f'The pair {sequence_0[k + j]}, {m} is not equal.')
break
except IndexError:
print('IndexError, end of the string')
else:
i += 1
motifs.append([])
return motifs
...
One way to go with it is to start zipping both lists:
a = ['A', 'T', 'C', 'A', 'C']
b = ['A', 'T', 'C', 'C', 'T']
c = list(zip(a,b))
In that case, c will have the list of tuples below
c = [('A','A'), ('T','T'), ('C','C'), ('A','C'), ('C','T')]
Then, you can go with list comprehension and enumerate:
d = [(i, t) for i, t in enumerate(c)]
This will bring something like this to you:
d = [(0, ('A','A')), (1, ('T','T')), (2, ('C','C')), ...]
Of course you can go for a one-liner, if you want:
d = [(i, t) for i, t in enumerate(zip(a,b))]
>>> [(0, ('A','A')), (1, ('T','T')), (2, ('C','C')), ...]
Now, you have to deal with the nested tuples. Focus on the internal ones. It is obvious that what you want is to compare the first element of the tuples with the second ones. But, also, you will need the position where the difference resides (that lies outside). So, let's build a function for it. Inside the function, i will capture the positions, and t will capture the inner tuples:
def compare(a, b):
d = [(i, t) for i, t in enumerate(zip(a,b))]
for i, t in d:
if t[0] != t[1]:
return i
return -1
In that way, if you get -1 at the end, it means that all elements in both lists are equal, side by side. Otherwise, you will get the position of the first difference between them.
It is important to notice that, in the case of two lists with different sizes, the zip function will bring a list of tuples with the size matching the smaller of the lists. The extra elements of the other list will be ignored.
Ex.
list(zip([1,2], [3,4,5]))
>>> [(1,3), (2,4)]
You can use the function compare with your code to get the positions where the lists differ, and use that to build your motifs.

Reducing compute time for Anagram word search

The code below is a brute force method of searching a list of words and creating sub-lists of any that are Anagrams.
Searching the entire English dictionary is prohibitively time consuming so I'm curious of anyone has tips for reducing the compute complexity of the code?
def anogramtastic(anagrms):
d = []
e = []
for j in range(len(anagrms)):
if anagrms[j] in e:
pass
else:
templist = []
tester = anagrms[j]
tester = list(tester)
tester.sort()
tester = ''.join(tester)
for k in range(len(anagrms)):
if k == j:
pass
else:
testers = anagrms[k]
testers = list(testers)
testers.sort()
testers = ''.join(testers)
if testers == tester:
templist.append(anagrms[k])
e.append(anagrms[k])
if len(templist) > 0:
templist.append(anagrms[j])
d.append(templist)
d.sort(key=len,reverse=True)
return d
print(anogramtastic(wordlist))
How about using a dictionary of frozensets? Frozensets are immutable, meaning you can hash them for constant lookup. And when it comes to anagrams, what makes two words anagrams of each other is that they have the same letters with the same count. So you can construct a frozenset of {(letter, count), ...} pairs, and hash these for efficient lookup.
Here's a quick little function to convert a word to a multiset using collections.Counter:
from collections import Counter, defaultdict
def word2multiset(word):
return frozenset(Counter(word).items())
Now, given a list of words, populate your anagram dictionary like this:
list_of_words = [... ]
anagram_dict = defaultdict(set)
for word in list_of_words:
anagram_dict[word2multiset(word)].add(word)
For example, when list_of_words = ['hello', 'olleh', 'test', 'apple'], this is the output of anagram_dict after a run of the loop above:
print(anagram_dict)
defaultdict(set,
{frozenset({('e', 1), ('h', 1), ('l', 2), ('o', 1)}): {'hello',
'olleh'},
frozenset({('e', 1), ('s', 1), ('t', 2)}): {'test'},
frozenset({('a', 1), ('e', 1), ('l', 1), ('p', 2)}): {'apple'}})
Unless I'm misunderstanding the problem, simply grouping the words by sorting their characters should be an efficient solution -- as you've already realized. The trick is to avoid comparing every word to all the other ones. A dict with the char-sorted string as key will make finding the right group for each word fast; a lookup/insertion will be O(log n).
#!/usr/bin/env python3
#coding=utf8
from sys import stdin
groups = {}
for line in stdin:
w = line.strip()
g = ''.join(sorted(w))
if g not in groups:
groups[g] = []
groups[g].append(w)
for g, words in groups.items():
if len(words) > 1:
print('%2d %-20s' % (len(words), g), ' '.join(words))
Testing on my words file (99171 words), it seems to work well:
anagram$ wc /usr/share/dict/words
99171 99171 938848 /usr/share/dict/words
anagram$ time ./anagram.py < /usr/share/dict/words | tail
2 eeeprsw sweeper weepers
2 brsu burs rubs
2 aeegnrv avenger engrave
2 ddenoru redound rounded
3 aesy ayes easy yeas
2 gimnpu impugn umping
2 deeiinsst densities destinies
2 abinost bastion obtains
2 degilr girdle glider
2 orsttu trouts tutors
real 0m0.366s
user 0m0.357s
sys 0m0.012s
You can speed things up considerably by using a dictionary for checking membership instead of doing linear searches. The only "trick" is to devise a way to create keys for it such that it will be the same for anagrammatical words (and not for others).
In the code below this is being done by creating a sorted tuple from the letters in each word.
def anagramtastic(words):
dct = {}
for word in words:
key = tuple(sorted(word)) # Identifier based on letters.
dct.setdefault(key, []).append(word)
# Return a list of all that had an anagram.
return [words for words in dct.values() if len(words) > 1]
wordlist = ['act', 'cat', 'binary', 'brainy', 'case', 'aces',
'aide', 'idea', 'earth', 'heart', 'tea', 'tee']
print('result:', anagramtastic(wordlist))
Output produced:
result: [['act', 'cat'], ['binary', 'brainy'], ['case', 'aces'], ['aide', 'idea'], ['earth', 'heart']]

Counting the number of times a letter occurs at a certain position using python

I'm a python beginner and I've come across this problem and I'm not sure how I'd go about tackling it.
If I have the following sequence/strings:
GATCCG
GTACGC
How to I count the frequency each letter occurs at each position. ie) G occurs at position one twice in the two sequences, A occurs at position 1 zero times etc.
Any help would be appreciated, thank you!
You can use a combination of defaultdict and enumerate like so:
from collections import defaultdict
sequences = ['GATCCG', 'GTACGC']
d = defaultdict(lambda: defaultdict(int)) # d[char][position] = count
for seq in sequences:
for i, char in enumerate(seq): # enum('abc'): [(0,'a'),(1,'b'),(2,'c')]
d[char][i] += 1
d['C'][3] # 2
d['C'][4] # 1
d['C'][5] # 1
This builds a nested defaultdict that takes the character as first and the position as second key and provides the count of occurrences of said character in said position.
If you want lists of position-counts:
max_len = max(map(len, sequences))
d = defaultdict(lambda: [0]*max_len) # d[char] = [pos0, pos12, ...]
for seq in sequences:
for i, char in enumerate(seq):
d[char][i] += 1
d['G'] # [2, 0, 0, 0, 1, 1]
Not sure this is the best way but you can use zip to do a sort of transpose on the the strings, producing tuples of the letters in each position, e.g.:
x = 'GATCCG'
y = 'GTACGC'
zipped = zip(x,y)
print zipped
will produce as output:
[('G', 'G'), ('A', 'T'), ('T', 'A'), ('C', 'C'), ('C', 'G'), ('G', 'C')]
You can see from the tuples that the first positions of the two strings contain two Gs, the second positions contain an A and a T, etc. Then you could use Counter (or some other method) to get at what you want.

How to store multiple numbers and then add to each individual one

I have written a for loop which gives me all the values of a specific letters place in the alphabet.
For example the word hello will give me the numbers 8, 5, 12, 12 and 14. Now I want to add them to another word which is the same length for e.g abcde, which would be 1, 2, 3, 4 and 5. Now I want to add the two numbers together but keeping the individual numbers for example 8+1, 5+2, 12+3, 12+4 and 14+5.
This is the code I have so far
for letter in message:
if letter.isalpha() == True:
x = alphabet.find(letter)
for letter in newkeyword:
if letter.isalpha() == True:
y = alphabet.find(letter)
When I try adding x and y, I get a single number. Can someone help?
If you are planning to do further calculations with the numbers consider this solution which creates a list of tuples (also by using zip, as #Kashyap Maduri suggested):
messages = zip(message, newkeyword)
positions = [(alphabet.find(m), alphabet.find(n)) for m, n in messages]
sums = [(a, b, a + b, "{}+{}".format(a,b)) for a, b in positions]
Each tuple in the sums list consists of both operands, their sum and a string representation of the addition.
Then you could for example print them sorted by their sum:
for a, b, sum_ab, sum_as_str in sorted(sums, key = lambda x: x[2]):
print(sum_as_str)
Edit
when i run the program i want it to give me the answer of those sums for e.g 14+5=19 i just want the 19 part any ideas? – Shahzaib Shuz Bari
This makes it a lot easier:
messages = zip(message, newkeyword)
sums = [alphabet.find(m) + alphabet.find(n) for m, n in messages]
And you get a list of all the sums.
You are looking for zip function. It zips 2 or more iterables together. For e.g.
l1 = 'abc'
l2 = 'def'
zip(l1, l2)
# [('a', 'd'), ('b', 'e'), ('c', 'f')] in python 2.7
and
list(zip(l1, l2))
# [('a', 'd'), ('b', 'e'), ('c', 'f')] in python 3
So here is a solution for your problem:
l = list(zip(message, newkeyword))
[str(alphabet.find(x)) + '+' + str(alphabet.find(y)) for x, y in l]

all combination of a complicated list

I want to find all possible combination of the following list:
data = ['a','b','c','d']
I know it looks a straightforward task and it can be achieved by something like the following code:
comb = [c for i in range(1, len(data)+1) for c in combinations(data, i)]
but what I want is actually a way to give each element of the list data two possibilities ('a' or '-a').
An example of the combinations can be ['a','b'] , ['-a','b'], ['a','b','-c'], etc.
without something like the following case of course ['-a','a'].
You could write a generator function that takes a sequence and yields each possible combination of negations. Like this:
import itertools
def negations(seq):
for prefixes in itertools.product(["", "-"], repeat=len(seq)):
yield [prefix + value for prefix, value in zip(prefixes, seq)]
print list(negations(["a", "b", "c"]))
Result (whitespace modified for clarity):
[
[ 'a', 'b', 'c'],
[ 'a', 'b', '-c'],
[ 'a', '-b', 'c'],
[ 'a', '-b', '-c'],
['-a', 'b', 'c'],
['-a', 'b', '-c'],
['-a', '-b', 'c'],
['-a', '-b', '-c']
]
You can integrate this into your existing code with something like
comb = [x for i in range(1, len(data)+1) for c in combinations(data, i) for x in negations(c)]
Once you have the regular combinations generated, you can do a second pass to generate the ones with "negation." I'd think of it like a binary number, with the number of elements in your list being the number of bits. Count from 0b0000 to 0b1111 via 0b0001, 0b0010, etc., and wherever a bit is set, negate that element in the result. This will produce 2^n combinations for each input combination of length n.
Here is one-liner, but it can be hard to follow:
from itertools import product
comb = [sum(t, []) for t in product(*[([x], ['-' + x], []) for x in data])]
First map data to lists of what they can become in results. Then take product* to get all possibilities. Finally, flatten each combination with sum.
My solution basically has the same idea as John Zwinck's answer. After you have produced the list of all combinations
comb = [c for i in range(1, len(data)+1) for c in combinations(data, i)]
you generate all possible positive/negative combinations for each element of comb. I do this by iterating though the total number of combinations, 2**(N-1), and treating it as a binary number, where each binary digit stands for the sign of one element. (E.g. a two-element list would have 4 possible combinations, 0 to 3, represented by 0b00 => (+,+), 0b01 => (-,+), 0b10 => (+,-) and 0b11 => (-,-).)
def twocombinations(it):
sign = lambda c, i: "-" if c & 2**i else ""
l = list(it)
if len(l) < 1:
return
# for each possible combination, make a tuple with the appropriate
# sign before each element
for c in range(2**(len(l) - 1)):
yield tuple(sign(c, i) + el for i, el in enumerate(l))
Now we apply this function to every element of comb and flatten the resulting nested iterator:
l = itertools.chain.from_iterable(map(twocombinations, comb))

Categories

Resources