Compressing multiple nested `for` loops - python

Similar to this and many other questions, I have many nested loops (up to 16) of the same structure.
Problem: I have 4-letter alphabet and want to get all possible words of length 16. I need to filter those words. These are DNA sequences (hence 4 letter: ATGC), filtering rules are quite simple:
no XXXX substrings (i.e. can't have same letter in a row more than 3 times, ATGCATGGGGCTA is "bad")
specific GC content, that is number of Gs + number of Cs should be in specific range (40-50%). ATATATATATATA and GCGCGCGCGCGC are bad words
itertools.product will work for that, but data structure here gonna be giant (4^16 = 4*10^9 words)
More importantly, if I do use product, then I still have to go through each element to filter it out. Thus I will have 4 billion steps times 2
My current solution is nested for loops
alphabet = ['a','t','g','c']
for p1 in alphabet:
for p2 in alphabet:
for p3 in alphabet:
...skip...
for p16 in alphabet:
word = p1+p2+p3+...+p16
if word_is_good(word):
good_words.append(word)
counter+=1
Is there good pattern to program that without 16 nested loops? Is there a way to parallelize it efficiently (on multi-core or multiple EC2 nodes)
Also with that pattern i can plug word_is_good? check inside middle of the loops: word that starts badly is bad
...skip...
for p3 in alphabet:
word_3 = p1+p2+p3
if not word_is_good(word_3):
break
for p4 in alphabet:
...skip...

from itertools import product, islice
from time import time
length = 16
def generate(start, alphabet):
"""
A recursive generator function which works like itertools.product
but restricts the alphabet as it goes based on the letters accumulated so far.
"""
if len(start) == length:
yield start
return
gcs = start.count('g') + start.count('c')
if gcs >= length * 0.5:
alphabet = 'at'
# consider the maximum number of Gs and Cs we can have in the end
# if we add one more A/T now
elif length - len(start) - 1 + gcs < length * 0.4:
alphabet = 'gc'
for c in alphabet:
if start.endswith(c * 3):
continue
for string in generate(start + c, alphabet):
yield string
def brute_force():
""" Straightforward method for comparison """
lower = length * 0.4
upper = length * 0.5
for s in product('atgc', repeat=length):
if lower <= s.count('g') + s.count('c') <= upper:
s = ''.join(s)
if not ('aaaa' in s or
'tttt' in s or
'cccc' in s or
'gggg' in s):
yield s
def main():
funcs = [
lambda: generate('', 'atgc'),
brute_force
]
# Testing performance
for func in funcs:
# This needs to be big to get an accurate measure,
# otherwise `brute_force` seems slower than it really is.
# This is probably because of how `itertools.product`
# is implemented.
count = 100000000
start = time()
for _ in islice(func(), count):
pass
print(time() - start)
# Testing correctness
global length
length = 12
for x, y in zip(*[func() for func in funcs]):
assert x == y, (x, y)
main()
On my machine, generate was just a bit faster than brute_force, at about 390 seconds vs 425. This was pretty much as fast as I could make them. I think the full thing would take about 2 hours. Of course, actually processing them will take much longer. The problem is that your constraints don't reduce the full set much.
Here's an example of how to use this in parallel across 16 processes:
from multiprocessing.pool import Pool
alpha = 'atgc'
def generate_worker(start):
start = ''.join(start)
for s in generate(start, alpha):
print(s)
Pool(16).map(generate_worker, product(alpha, repeat=2))

Since you happen to have an alphabet of length 4 (or any "power of 2 integer"), the idea of using and integer ID and bit-wise operations comes to mind instead of checking for consecutive characters in strings. We can assign an integer value to each of the characters in alphabet, for simplicity lets use the index corresponding to each letter.
Example:
6546354310 = 33212321033134 = 'aaaddcbcdcbaddbd'
The following function converts from a base 10 integer to a word using alphabet.
def id_to_word(word_id, word_len):
word = ''
while word_id:
rem = word_id & 0x3 # 2 bits pet letter
word = ALPHABET[rem] + word
word_id >>= 2 # Bit shift to the next letter
return '{2:{0}>{1}}'.format(ALPHABET[0], word_len, word)
Now for a function to check whether a word is "good" based on its integer ID. The following method is of a similar format to id_to_word, except a counter is used to keep track of consecutive characters. The function will return False if the maximum number of identical consecutive characters is exceeded, otherwise it returns True.
def check_word(word_id, max_consecutive):
consecutive = 0
previous = None
while word_id:
rem = word_id & 0x3
if rem != previous:
consecutive = 0
consecutive += 1
if consecutive == max_consecutive + 1:
return False
word_id >>= 2
previous = rem
return True
We're effectively thinking of each word as an integer with base 4. If the Alphabet length was not a "power of 2" value, then modulo % alpha_len and integer division // alpha_len could be used in place of & log2(alpha_len) and >> log2(alpha_len) respectively, although it would take much longer.
Finally, finding all the good words for a given word_len. The advantage of using a range of integer values is that you can reduce the number of for-loops in your code from word_len to 2, albeit the outer loop is very large. This may allow for more friendly multiprocessing of your good word finding task. I have also added in a quick calculation to determine the smallest and largest IDs corresponding to good words, which helps significantly narrow down the search for good words
ALPHABET = ('a', 'b', 'c', 'd')
def find_good_words(word_len):
max_consecutive = 3
alpha_len = len(ALPHABET)
# Determine the words corresponding to the smallest and largest ids
smallest_word = '' # aaabaaabaaabaaab
largest_word = '' # dddcdddcdddcdddc
for i in range(word_len):
if (i + 1) % (max_consecutive + 1):
smallest_word = ALPHABET[0] + smallest_word
largest_word = ALPHABET[-1] + largest_word
else:
smallest_word = ALPHABET[1] + smallest_word
largest_word = ALPHABET[-2] + largest_word
# Determine the integer ids of said words
trans_table = str.maketrans({c: str(i) for i, c in enumerate(ALPHABET)})
smallest_id = int(smallest_word.translate(trans_table), alpha_len) # 1077952576
largest_id = int(largest_word.translate(trans_table), alpha_len) # 3217014720
# Find and store the id's of "good" words
counter = 0
goodies = []
for i in range(smallest_id, largest_id + 1):
if check_word(i, max_consecutive):
goodies.append(i)
counter += 1
In this loop I have specifically stored the word's ID as opposed to the actual word itself incase you are going to use the words for further processing. However, if you are just after the words then change the second to last line to read goodies.append(id_to_word(i, word_len)).
NOTE: I receive a MemoryError when attempting to store all good IDs for word_len >= 14. I suggest writing these IDs/words to a file of some sort!

Related

Find Longest Alphabetically Ordered Substring - Efficiently

The goal of some a piece of code I wrote is to find the longest alphabetically ordered substring within a string.
"""
Find longest alphabetically ordered substring in string s.
"""
s = 'zabcabcd' # Test string.
alphabetical_str, temp_str = s[0], s[0]
for i in range(len(s) - 1): # Loop through string.
if s[i] <= s[i + 1]: # Check if next character is alphabetically next.
temp_str += s[i + 1] # Add character to temporary string.
if len(temp_str) > len(alphabetical_str): # Check is temporary string is the longest string.
alphabetical_str = temp_str # Assign longest string.
else:
temp_str = s[i + 1] # Assign last checked character to temporary string.
print(alphabetical_str)
I get an output of abcd.
But the instructor says there is PEP 8 compliant way of writing this code that is 7-8 lines of code and there is a more computational efficient way of writing this code that is ~16 lines. Also that there is a way of writing this code in only 1 line 75 character!
Can anyone provide some insight on what the code would look like if it was 7-8 lines or what the most work appropriate way of writing this code would be? Also any PEP 8 compliance critique would be appreciated.
Linear time:
s = 'zabcabcd'
longest = current = []
for c in s:
if [c] < current[-1:]:
current = []
current += c
longest = max(longest, current, key=len)
print(''.join(longest))
Your PEP 8 issues I see:
"Limit all lines to a maximum of 79 characters." (link) - You have two lines longer than that.
"do not rely on CPython’s efficient implementation of in-place string concatenation for statements in the form a += b" [...] the ''.join() form should be used instead" (link). You do that repeated string concatenation.
Also, yours crashes if the input string is empty.
1 line 72 characters:
s='zabcabcd';print(max([t:='']+[t:=t*(c>=t[-1:])+c for c in s],key=len))
Optimized linear time (I might add benchmarks tomorrow):
def Kelly_fast(s):
maxstart = maxlength = start = length = 0
prev = ''
for c in s:
if c >= prev:
length += 1
else:
if length > maxlength:
maxstart = start
maxlength = length
start += length
length = 1
prev = c
if length > maxlength:
maxstart = start
maxlength = length
return s[maxstart : maxstart+maxlength]
Depending on how you choose to count, this is only 6-7 lines and PEP 8 compliant:
def longest_alphabetical_substring(s):
sub = '', 0
for i in range(len(s)):
j = i + len(sub) + 1
while list(s[i:j]) == sorted(s[i:j]) and j <= len(s):
sub, j = s[i:j], j+1
return sub
print(longest_alphabetical_substring('zabcabcd'))
Your own code was PEP 8 compliant as far as I can tell, although it would make sense to capture code like this in a function, for easy reuse and logical grouping for improved readability.
The solution I provided here is not very efficient, as it keeps extracting copies of the best result so far. A slightly longer solution that avoids this:
def longest_alphabetical_substring(s):
n = m = 0
for i in range(len(s)):
for j in range(i+1, len(s)+1):
if j == len(s) or s[j] < s[j-1]:
if j-i > m-n:
n, m = i, j
break
return s[n:m]
print(longest_alphabetical_substring('zabcabcd'))
There may be more efficient ways of doing this; for example you could detect that there's no need to keep looking because there is not enough room left in the string to find longer strings, and exit the outer loop sooner.
User #kellybundy is correct, a truly efficient solution would be linear in time. Something like:
def las_efficient(s):
t = s[0]
return max([(t := c) if c < t[-1] else (t := t + c) for c in s[1:]], key=len)
print(las_efficient('zabcabcd'))
No points for readability here, but PEP 8 otherwise, and very brief.
And for an even more efficient solution:
def las_very_efficient(s):
m, lm, t, ls = '', 0, s[0], len(s)
for n, c in enumerate(s[1:]):
if c < t[-1]:
t = c
else:
t += c
if len(t) > lm:
m, lm = t, len(t)
if n + lm > ls:
break
return m
You can keep appending characters from the input string to a candidate list, but clear the list when the current character is lexicographically smaller than the last character in the list, and set the candidate list as the output list if it's longer than the current output list. Join the list into a string for the final output:
s = 'zabcabcdabc'
candidate = longest = []
for c in s:
if candidate and c < candidate[-1]:
candidate = []
candidate.append(c)
if len(candidate) > len(longest):
longest = candidate
print(''.join(longest))
This outputs:
abcd

How to find the max number of times a sequence of characters repeats consecutively in a string? [duplicate]

This question already has answers here:
How to count consecutive repetitions of a substring in a string?
(4 answers)
Closed 1 year ago.
I'm working on a cs50/pset6/dna project. I'm struggling with finding a way to analyze a sequence of strings, and gather the maximum number of times a certain sequence of characters repeats consecutively. Here is an example:
String: JOKHCNHBVDBVDBVDJHGSBVDBVD
Sequence of characters I should look for: BVD
Result: My function should be able to return 3, because in one point the characters BVD repeat three times consecutively, and even though it repeats again two times, I should look for the time that it repeats the most number of times.
It's a bit lame, but one "brute-force"ish way would be to just check for the presence of the longest substring possible. As soon as a substring is found, break out of the loop:
EDIT - Using a function might be more straight forward:
def get_longest_repeating_pattern(string, pattern):
if not pattern:
return ""
for i in range(len(string)//len(pattern), 0, -1):
current_pattern = pattern * i
if current_pattern in string:
return current_pattern
return ""
string = "JOKHCNHBVDBVDBVDJHGSBVDBVD"
pattern = "BVD"
longest_repeating_pattern = get_longest_repeating_pattern(string, pattern)
print(len(longest_repeating_pattern))
EDIT - explanation:
First, just a simple for-loop that starts at a larger number and goes down to a smaller number. For example, we start at 5 and go down to 0 (but not including 0), with a step size of -1:
>>> for i in range(5, 0, -1):
print(i)
5
4
3
2
1
>>>
if string = "JOKHCNHBVDBVDBVDJHGSBVDBVD", then len(string) would be 26, if pattern = "BVD", then len(pattern) is 3.
Back to my original code:
for i in range(len(string)//len(pattern), 0, -1):
Plugging in the numbers:
for i in range(26//3, 0, -1):
26//3 is an integer division which yields 8, so this becomes:
for i in range(8, 0, -1):
So, it's a for-loop that goes from 8 to 1 (remember, it doesn't go down to 0). i takes on the new value for each iteration, first 8 , then 7, etc.
In Python, you can "multiply" strings, like so:
>>> pattern = "BVD"
>>> pattern * 1
'BVD'
>>> pattern * 2
'BVDBVD'
>>> pattern * 3
'BVDBVDBVD'
>>>
A slightly less bruteforcey solution:
string = 'JOKHCNHBVDBVDBVDJHGSBVDBVD'
key = 'BVD'
len_k = len(key)
max_l = 0
passes = 0
curr_len=0
for i in range(len(string) - len_k + 1): # split the string into substrings of same len as key
if passes > 0: # If key was found in previous sequences, pass ()this way, if key is 'BVD', we will ignore 'VD.' and 'D..'
passes-=1
continue
s = string[i:i+len_k]
if s == key:
curr_len+=1
if curr_len > max_l:
max_l=curr_len
passes = len(key)-1
if prev_s == key:
if curr_len > max_l:
max_l=curr_len
else:
curr_len=0
prev_s = s
print(max_l)
You can do that very easily, elegantly and efficiently using a regex.
We look for all sequences of at least one repetition of your search string. Then, we just need to take the maximum length of these sequences, and divide by the length of the search string.
The regex we use is '(:?<your_sequence>)+': at least one repetition (the +) of the group (<your_sequence>). The :? is just here to make the group non capturing, so that findall returns the whole match, and not just the group.
In case there is no match, we use the default parameter of the max function to return 0.
The code is very short, then:
import re
def max_consecutive_repetitions(search, data):
search_re = re.compile('(?:' + search + ')+')
return max((len(seq) for seq in search_re.findall(data)), default=0) // len(search)
Sample run:
print(max_consecutive_repetitions("BVD", "JOKHCNHBVDBVDBVDJHGSBVDBVD"))
# 3
This is my contribution, I'm not a professional but it worked for me (sorry for bad English)
results = {}
# Loops through all the STRs
for i in range(1, len(reader.fieldnames)):
STR = reader.fieldnames[i]
j = 0
s=0
pre_s = 0
# Loops through all the characters in sequence.txt
while j < (len(sequence) - len(STR)):
# checks if the character we are currently looping is the same than the first STR character
if STR[0] == sequence[j]:
# while the sub-string since j to j - STR lenght is the same than STR, I called this a streak
while sequence[j:(j + len(STR))] == STR:
# j skips to the end of sub-string
j += len(STR)
# streaks counter
s += 1
# if s > 0 means that that the whole STR and sequence coincided at least once
if s > 0:
# save the largest streak as pre_s
if s > pre_s:
pre_s = s
# restarts the streak counter to continue exploring the sequence
s=0
j += 1
# assigns pre_s value to a dictionary with the current STR as key
results[STR] = pre_s
print(results)

Next lexicographically bigger string permutation and solution efficiency

I'm trying to solve Hackerrank question: Find next lexicographically bigger string permutation for a given string input.
Here my solution:
def biggerIsGreater(w):
if len(w)<=1: return w
# pair letters in w string with int representing positional index in alphabet
letter_mapping = dict(zip(string.ascii_lowercase, range(1, len(string.ascii_lowercase)+1)))
char_ints = [letter_mapping[letter] for letter in w.lower() if letter in letter_mapping]
# reverse it
reversed_char_ints = char_ints[::-1]
# get char set to reorder, including pivot.
scanned_char_ints = []
index = 0
zipped = list(zip(reversed_char_ints, reversed_char_ints[1:]))
while index < len(zipped):
char_tuple = zipped[index]
scanned_char_ints.append(char_tuple[0])
if char_tuple[0] <= char_tuple[1]:
if index == len(zipped) - 1:
return "no answer"
else:
scanned_char_ints.append(char_tuple[1])
break
index += 1
# get smallest among bigger values of pivot
char_to_switch = None
char_to_switch_index = None
for item in scanned_char_ints[:-1]:
if item > scanned_char_ints[-1]:
if char_to_switch == None or item <= char_to_switch:
char_to_switch = item
char_to_switch_index = scanned_char_ints.index(item)
# switch pivot and smallest of bigger chars in scanned chars
pivot_index = len(scanned_char_ints) - 1
scanned_char_ints[pivot_index], scanned_char_ints[char_to_switch_index] = scanned_char_ints[char_to_switch_index], scanned_char_ints[pivot_index]
# order from second to end the other chars, so to find closest bigger number of starting number
ord_scanned_char_ints = scanned_char_ints[:-1]
ord_scanned_char_ints.sort(reverse=True)
ord_scanned_char_ints.append(scanned_char_ints[-1])
# reverse scanned chars
ord_scanned_char_ints.reverse()
# rebuild char int list
result_ints = char_ints[:len(char_ints) - len(ord_scanned_char_ints)]
result_ints.extend(ord_scanned_char_ints)
result_ = ""
for char_intx in result_ints:
for char, int_charz in letter_mapping.items():
if int_charz == char_intx:
result_ += char
return result_
(I know that there are solution on internet with more concise way of implement the problem, but I obviously trying to succeed by myself).
Now, it seems to run for 1, 2, 100 input of strings with at most 100 characters.
But when hackerrank test procedure tests it against 100000 strings of at most 100 letters, an error results, with no further information about it. Running a test with a similar input size, in my machine, does not throw any error.
What is wrong with this solution?
Thanks in advance

Find the shortest substring whose replacement makes the string contain equal number of each character

I have a string of length n composed of letters A,G,C and T. The string is steady if it contains equal number of A,G,C and T(each n/4 times). I need to find the minimum length of the substring that when replaced makes it steady. Here's a link to the full description of the problem.
Suppose s1=AAGAAGAA.
Now since n=8 ideally it should have 2 As, 2 Ts, 2 Gs and 2 Cs. It has 4 excessive As. Hence we need a substring which contains at least 4 As.
I start by taking a 4 character substring from left and if not found then I increment a variable mnum(ie look for 5 variable substrings and so on).
We get AAGAA as an answer. But it's too slow.
from collections import Counter
import sys
n=int(input()) #length of string
s1=input()
s=Counter(s1)
le=int(n/4) #ideal length of each element
comp={'A':le,'G':le,'C':le,'T':le} #dictionary containing equal number of all elements
s.subtract(comp) #Finding by how much each element ('A','G'...) is in excess or loss
a=[]
b=[]
for x in s.values(): #storing frequency(s.values--[4,2]) of elements which are in excess
if(x>0):
a.append(x)
for x in s.keys(): #storing corresponding elements(s.keys--['A','G'])
if(s[x]>0):
b.append(x)
mnum=sum(a) #minimum substring length to start with
if(mnum==0):
print(0)
sys.exit
flag=0
while(mnum<=n): #(when length 4 substring with all the A's and G's is not found increasing to 5 and so on)
for i in range(n-mnum+1): #Finding substrings with length mnum in s1
for j in range(len(a)): #Checking if all of excess elements are present
if(s1[i:i+mnum].count(b[j])==a[j]):
flag=1
else:
flag=0
if(flag==1):
print(mnum)
sys.exit()
mnum+=1
The minimal substring can be found in O(N) time and O(N) space.
First count a frequency fr[i] of each character from the input of length n.
Now, the most important thing to realise is that the necessary and sufficient condition for a substring to be considered minimal, it must contain each excessive character with a frequency of at least fr[i] - n/4. Otherwise, it won't be possible to replace the missing characters. So, our task is to go through each such substring and pick the one with the minimal length.
But how to find all of them efficiently?
At start, minLength is n. We introduce 2 pointer indices - left and right (initially 0) that define a substring from left to right in the original string str. Then, we increment right until the frequency of each excessive character in str[left:right] is at least fr[i] - n/4. But it's not all yet since str[left : right] may contain unnecessary chars to the left (for example, they're not excessive and so can be removed). So, we increment left as long as str[left : right] still contains enough excessive elements. When we're finished we update minLength if it's larger than right - left. We repeat the procedure until right >= n.
Let's consider an example. Let GAAAAAAA be the input string. Then, the algorithm steps are as below:
1.Count frequencies of each character:
['G'] = 1, ['A'] = 6, ['T'] = 0, ['C'] = 0 ('A' is excessive here)
2.Now iterate through the original string:
Step#1: |G|AAAAAAA
substr = 'G' - no excessive chars (left = 0, right = 0)
Step#2: |GA|AAAAAA
substr = 'GA' - 1 excessive char, we need 5 (left = 0, right = 1)
Step#3: |GAA|AAAAA
substr = 'GAA' - 2 excessive chars, we need 5 (left = 0, right = 2)
Step#4: |GAAA|AAAA
substr = 'GAAA' - 3 excessive chars, we need 5 (left = 0, right = 3)
Step#5: |GAAAA|AAA
substr = 'GAAAA' - 4 excessive chars, we need 5 (left = 0, right = 4)
Step#6: |GAAAAA|AA
substr = 'GAAAAA' - 5 excessive chars, nice but can we remove something from left? 'G' is not excessive anyways. (left = 0, right = 5)
Step#7: G|AAAAA|AA
substr = 'AAAAA' - 5 excessive chars, wow, it's smaller now. minLength = 5 (left = 1, right = 5)
Step#8: G|AAAAAA|A
substr = 'AAAAAA' - 6 excessive chars, nice, but can we reduce the substr? There's a redundant 'A'(left = 1, right = 6)
Step#9: GA|AAAAA|A
substr = 'AAAAA' - 5 excessive chars, nice, minLen = 5 (left = 2, right = 6)
Step#10: GA|AAAAAA|
substr = 'AAAAAA' - 6 excessive chars, nice, but can we reduce the substr? There's a redundant 'A'(left = 2, right = 7)
Step#11: GAA|AAAAA|
substr = 'AAAAA' - 5 excessive chars, nice, minLen = 5 (left = 3, right = 7)
Step#12: That's it as right >= 8
Or the full code below:
from collections import Counter
n = int(input())
gene = raw_input()
char_counts = Counter()
for i in range(n):
char_counts[gene[i]] += 1
n_by_4 = n / 4
min_length = n
left = 0
right = 0
substring_counts = Counter()
while right < n:
substring_counts[gene[right]] += 1
right += 1
has_enough_excessive_chars = True
for ch in "ACTG":
diff = char_counts[ch] - n_by_4
# the char cannot be used to replace other items
if (diff > 0) and (substring_counts[ch] < diff):
has_enough_excessive_chars = False
break
if has_enough_excessive_chars:
while left < right and substring_counts[gene[left]] > (char_counts[gene[left]] - n_by_4):
substring_counts[gene[left]] -= 1
left += 1
min_length = min(min_length, right - left)
print (min_length)
Here's one solution with limited testing done. This should give you some ideas on how to improve your code.
from collections import Counter
import sys
import math
n = int(input())
s1 = input()
s = Counter(s1)
if all(e <= n/4 for e in s.values()):
print(0)
sys.exit(0)
result = math.inf
out = 0
for mnum in range(n):
s[s1[mnum]] -= 1
while all(e <= n/4 for e in s.values()) and out <= mnum:
result = min(result, mnum - out + 1)
s[s1[out]] += 1
out += 1
print(result)

improve my code to group same words in a large list python and comparison to other code

I've been reading some of the other links ( What is a good strategy to group similar words? and Fuzzy Group By, Grouping Similar Words) that are related to group similar words. I'm curious (1) if someone can give me guidance as to how one of the algorithms I found in the second link works and (2) how the style of the programming compares to my own 'naive' approach?
If you can even just answer either 1 or 2, I'll give you an upvote.
(1) Can someone help step me through what's going on here?
class Seeder:
def __init__(self):
self.seeds = set()
self.cache = dict()
def get_seed(self, word):
LIMIT = 2
seed = self.cache.get(word,None)
if seed is not None:
return seed
for seed in self.seeds:
if self.distance(seed, word) <= LIMIT:
self.cache[word] = seed
return seed
self.seeds.add(word)
self.cache[word] = word
return word
def distance(self, s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
for zz in xrange(0,l2):
for sz in xrange(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
return matrix[l2][l1]
import itertools
def group_similar(words):
seeder = Seeder()
words = sorted(words, key=seeder.get_seed)
groups = itertools.groupby(words, key=seeder.get_seed)
(2)
In my approach I have a list of strings I want to group called residencyList and used default dictionaries.
Array(['Psychiatry', 'Radiology Medicine-Prelim',
'Radiology Medicine-Prelim', 'Medicine', 'Medicine',
'Obstetrics/Gynecology', 'Obstetrics/Gyncology',
'Orthopaedic Surgery', 'Surgery', 'Pediatrics',
'Medicine/Pediatrics',])
My effort to group. I base it off uniqueResList, which is np.unique(residencyList)
d = collections.defaultdict(int)
for i in residencyList:
for x in uniqueResList:
if x == i:
if not d[x]:
#print i, x
d[x] = i
#print d
if d[x]:
d[x] = d.get(x, ()) + ', ' + i
else:
#print 'no match'
continue
A short explanation of the "ninja maths" in distance:
# this is just the edit distance (Levenshtein) between the two words
def distance(self, s1, s2):
l1 = len(s1) # length of first word
l2 = len(s2) # length of second word
matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
# make an l2 + 1 by l1 + 1 matrix where the first row and column count up from
# 0 to l1 and l2 respectively (these will be the costs of
# deleting the letters that came before that element in each word)
for zz in xrange(0,l2):
for sz in xrange(0,l1):
if s1[sz] == s2[zz]: # if the two letters are the same then we
# don't have to change them so take the
# cheapest path from the options of
# matrix[zz+1][sz] + 1 (delete the letter in s1)
# matrix[zz][sz+1] + 1 (delete the letter in s2)
# matrix[zz][sz] (leave both letters)
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
else: # if the two letters are not the same then we
# have to change them so take the
# cheapest path from the options of
# matrix[zz+1][sz] + 1 (delete the letter in s1)
# matrix[zz][sz+1] + 1 (delete the letter in s2)
# matrix[zz][sz] + 1 (swap a letter)
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
return matrix[l2][l1] # the value at the bottom of the matrix is equal to the cheapest set of edits
I'll try to answer the first part. The class Seeder attempts to find seeds of the words. Two similar words are assumed to have the same seed and the similarity is controlled by the parameter LIMIT (In this case it is 2) which measures a distance between two words. There are many ways to compute String distance and your class does so in the distance function using some sort of ninja maths that is frankly above me.
def __init__(self):
self.seeds = set()
self.cache = dict()
Initialize the seeds as a set that keeps track of unique seeds till now and a cache that speeds up lookups in case we already have already seen the word (To save computation time).
For any word the get_seed function returns its seed.
def get_seed(self, word):
#Set the acceptable distance
LIMIT = 2
#Have we seen this word before?
seed = self.cache.get(word,None)
if seed is not None:
#YES. Return from the cache
return seed
for seed in self.seeds:
#NO. For each pre-existing seed, find the distance of this word from that seed
if self.distance(seed, word) <= LIMIT:
#This word is similar to the seed
self.cache[word] = seed
#We found this word's seed, cache it and return
return seed
#No we couldn't find a matching word in seeds. This is a new seed
self.seeds.add(word)
#Cache this word for future
self.cache[word] = word
#And return the seed (=word)
return word
Then you sort the list of words in question by their seed. This ensures that words that have the same seed occur next to each other. This is important for the group by you use to form groups of words based on seed.
The distance function looks complicated and could possibly be replaced by something like Levenshtein.

Categories

Resources