Related
This question already has answers here:
How to count consecutive repetitions of a substring in a string?
(4 answers)
Closed 1 year ago.
I'm working on a cs50/pset6/dna project. I'm struggling with finding a way to analyze a sequence of strings, and gather the maximum number of times a certain sequence of characters repeats consecutively. Here is an example:
String: JOKHCNHBVDBVDBVDJHGSBVDBVD
Sequence of characters I should look for: BVD
Result: My function should be able to return 3, because in one point the characters BVD repeat three times consecutively, and even though it repeats again two times, I should look for the time that it repeats the most number of times.
It's a bit lame, but one "brute-force"ish way would be to just check for the presence of the longest substring possible. As soon as a substring is found, break out of the loop:
EDIT - Using a function might be more straight forward:
def get_longest_repeating_pattern(string, pattern):
if not pattern:
return ""
for i in range(len(string)//len(pattern), 0, -1):
current_pattern = pattern * i
if current_pattern in string:
return current_pattern
return ""
string = "JOKHCNHBVDBVDBVDJHGSBVDBVD"
pattern = "BVD"
longest_repeating_pattern = get_longest_repeating_pattern(string, pattern)
print(len(longest_repeating_pattern))
EDIT - explanation:
First, just a simple for-loop that starts at a larger number and goes down to a smaller number. For example, we start at 5 and go down to 0 (but not including 0), with a step size of -1:
>>> for i in range(5, 0, -1):
print(i)
5
4
3
2
1
>>>
if string = "JOKHCNHBVDBVDBVDJHGSBVDBVD", then len(string) would be 26, if pattern = "BVD", then len(pattern) is 3.
Back to my original code:
for i in range(len(string)//len(pattern), 0, -1):
Plugging in the numbers:
for i in range(26//3, 0, -1):
26//3 is an integer division which yields 8, so this becomes:
for i in range(8, 0, -1):
So, it's a for-loop that goes from 8 to 1 (remember, it doesn't go down to 0). i takes on the new value for each iteration, first 8 , then 7, etc.
In Python, you can "multiply" strings, like so:
>>> pattern = "BVD"
>>> pattern * 1
'BVD'
>>> pattern * 2
'BVDBVD'
>>> pattern * 3
'BVDBVDBVD'
>>>
A slightly less bruteforcey solution:
string = 'JOKHCNHBVDBVDBVDJHGSBVDBVD'
key = 'BVD'
len_k = len(key)
max_l = 0
passes = 0
curr_len=0
for i in range(len(string) - len_k + 1): # split the string into substrings of same len as key
if passes > 0: # If key was found in previous sequences, pass ()this way, if key is 'BVD', we will ignore 'VD.' and 'D..'
passes-=1
continue
s = string[i:i+len_k]
if s == key:
curr_len+=1
if curr_len > max_l:
max_l=curr_len
passes = len(key)-1
if prev_s == key:
if curr_len > max_l:
max_l=curr_len
else:
curr_len=0
prev_s = s
print(max_l)
You can do that very easily, elegantly and efficiently using a regex.
We look for all sequences of at least one repetition of your search string. Then, we just need to take the maximum length of these sequences, and divide by the length of the search string.
The regex we use is '(:?<your_sequence>)+': at least one repetition (the +) of the group (<your_sequence>). The :? is just here to make the group non capturing, so that findall returns the whole match, and not just the group.
In case there is no match, we use the default parameter of the max function to return 0.
The code is very short, then:
import re
def max_consecutive_repetitions(search, data):
search_re = re.compile('(?:' + search + ')+')
return max((len(seq) for seq in search_re.findall(data)), default=0) // len(search)
Sample run:
print(max_consecutive_repetitions("BVD", "JOKHCNHBVDBVDBVDJHGSBVDBVD"))
# 3
This is my contribution, I'm not a professional but it worked for me (sorry for bad English)
results = {}
# Loops through all the STRs
for i in range(1, len(reader.fieldnames)):
STR = reader.fieldnames[i]
j = 0
s=0
pre_s = 0
# Loops through all the characters in sequence.txt
while j < (len(sequence) - len(STR)):
# checks if the character we are currently looping is the same than the first STR character
if STR[0] == sequence[j]:
# while the sub-string since j to j - STR lenght is the same than STR, I called this a streak
while sequence[j:(j + len(STR))] == STR:
# j skips to the end of sub-string
j += len(STR)
# streaks counter
s += 1
# if s > 0 means that that the whole STR and sequence coincided at least once
if s > 0:
# save the largest streak as pre_s
if s > pre_s:
pre_s = s
# restarts the streak counter to continue exploring the sequence
s=0
j += 1
# assigns pre_s value to a dictionary with the current STR as key
results[STR] = pre_s
print(results)
Help! I'm a Python beginner given the assignment of displaying the Collatz Sequence from a user-inputted integer, and displaying the contents in columns and rows. As you may know, the results could be 10 numbers, 30, or 100. I'm supposed to use '\t'. I've tried many variations, but at best, only get a single column. e.g.
def sequence(number):
if number % 2 == 0:
return number // 2
else:
result = number * 3 + 1
return result
n = int(input('Enter any positive integer to see Collatz Sequence:\n'))
while sequence != 1:
n = sequence(int(n))
print('%s\t' % n)
if n == 1:
print('\nThank you! The number 1 is the end of the Collatz Sequence')
break
Which yields a single vertical column, rather than the results being displayed horizontally. Ideally, I'd like to display 10 results left to right, and then go to another line. Thanks for any ideas!
Something like this maybe:
def get_collatz(n):
return [n // 2, n * 3 + 1][n % 2]
while True:
user_input = input("Enter a positive integer: ")
try:
n = int(user_input)
assert n > 1
except (ValueError, AssertionError):
continue
else:
break
sequence = [n]
while True:
last_item = sequence[-1]
if last_item == 1:
break
sequence.append(get_collatz(last_item))
print(*sequence, sep="\t")
Output:
Enter a positive integer: 12
12 6 3 10 5 16 8 4 2 1
>>>
EDIT Trying to keep it similar to your code:
I would change your sequence function to something like this:
def get_collatz(n):
if n % 2 == 0:
return n // 2
return n * 3 + 1
I called it get_collatz because I think that is more descriptive than sequence, it's still not a great name though - if you wanted to be super explicit maybe get_collatz_at_n or something.
Notice, I took the else branch out entirely, since it's not required. If n % 2 == 0, then we return from the function, so either you return in the body of the if or you return one line below - no else necessary.
For the rest, maybe:
last_number = int(input("Enter a positive integer: "))
while last_number != 1:
print(last_number, end="\t")
last_number = get_collatz(last_number)
In Python, print has an optional keyword parameter named end, which by default is \n. It signifies which character should be printed at the very end of a print-statement. By simply changing it to \t, you can print all elements of the sequence on one line, separated by tabs (since each number in the sequence invokes a separate print-statement).
With this approach, however, you'll have to make sure to print the trailing 1 after the while loop has ended, since the loop will terminate as soon as last_number becomes 1, which means the loop won't have a chance to print it.
Another way of printing the sequence (with separating tabs), would be to store the sequence in a list, and then use str.join to create a string out of the list, where each element is separated by some string or character. Of course this requires that all elements in the list are strings to begin with - in this case I'm using map to convert the integers to strings:
result = "\t".join(map(str, [12, 6, 3, 10, 5, 16, 8, 4, 2, 1]))
print(result)
Output:
12 6 3 10 5 16 8 4 2 1
>>>
Similar to this and many other questions, I have many nested loops (up to 16) of the same structure.
Problem: I have 4-letter alphabet and want to get all possible words of length 16. I need to filter those words. These are DNA sequences (hence 4 letter: ATGC), filtering rules are quite simple:
no XXXX substrings (i.e. can't have same letter in a row more than 3 times, ATGCATGGGGCTA is "bad")
specific GC content, that is number of Gs + number of Cs should be in specific range (40-50%). ATATATATATATA and GCGCGCGCGCGC are bad words
itertools.product will work for that, but data structure here gonna be giant (4^16 = 4*10^9 words)
More importantly, if I do use product, then I still have to go through each element to filter it out. Thus I will have 4 billion steps times 2
My current solution is nested for loops
alphabet = ['a','t','g','c']
for p1 in alphabet:
for p2 in alphabet:
for p3 in alphabet:
...skip...
for p16 in alphabet:
word = p1+p2+p3+...+p16
if word_is_good(word):
good_words.append(word)
counter+=1
Is there good pattern to program that without 16 nested loops? Is there a way to parallelize it efficiently (on multi-core or multiple EC2 nodes)
Also with that pattern i can plug word_is_good? check inside middle of the loops: word that starts badly is bad
...skip...
for p3 in alphabet:
word_3 = p1+p2+p3
if not word_is_good(word_3):
break
for p4 in alphabet:
...skip...
from itertools import product, islice
from time import time
length = 16
def generate(start, alphabet):
"""
A recursive generator function which works like itertools.product
but restricts the alphabet as it goes based on the letters accumulated so far.
"""
if len(start) == length:
yield start
return
gcs = start.count('g') + start.count('c')
if gcs >= length * 0.5:
alphabet = 'at'
# consider the maximum number of Gs and Cs we can have in the end
# if we add one more A/T now
elif length - len(start) - 1 + gcs < length * 0.4:
alphabet = 'gc'
for c in alphabet:
if start.endswith(c * 3):
continue
for string in generate(start + c, alphabet):
yield string
def brute_force():
""" Straightforward method for comparison """
lower = length * 0.4
upper = length * 0.5
for s in product('atgc', repeat=length):
if lower <= s.count('g') + s.count('c') <= upper:
s = ''.join(s)
if not ('aaaa' in s or
'tttt' in s or
'cccc' in s or
'gggg' in s):
yield s
def main():
funcs = [
lambda: generate('', 'atgc'),
brute_force
]
# Testing performance
for func in funcs:
# This needs to be big to get an accurate measure,
# otherwise `brute_force` seems slower than it really is.
# This is probably because of how `itertools.product`
# is implemented.
count = 100000000
start = time()
for _ in islice(func(), count):
pass
print(time() - start)
# Testing correctness
global length
length = 12
for x, y in zip(*[func() for func in funcs]):
assert x == y, (x, y)
main()
On my machine, generate was just a bit faster than brute_force, at about 390 seconds vs 425. This was pretty much as fast as I could make them. I think the full thing would take about 2 hours. Of course, actually processing them will take much longer. The problem is that your constraints don't reduce the full set much.
Here's an example of how to use this in parallel across 16 processes:
from multiprocessing.pool import Pool
alpha = 'atgc'
def generate_worker(start):
start = ''.join(start)
for s in generate(start, alpha):
print(s)
Pool(16).map(generate_worker, product(alpha, repeat=2))
Since you happen to have an alphabet of length 4 (or any "power of 2 integer"), the idea of using and integer ID and bit-wise operations comes to mind instead of checking for consecutive characters in strings. We can assign an integer value to each of the characters in alphabet, for simplicity lets use the index corresponding to each letter.
Example:
6546354310 = 33212321033134 = 'aaaddcbcdcbaddbd'
The following function converts from a base 10 integer to a word using alphabet.
def id_to_word(word_id, word_len):
word = ''
while word_id:
rem = word_id & 0x3 # 2 bits pet letter
word = ALPHABET[rem] + word
word_id >>= 2 # Bit shift to the next letter
return '{2:{0}>{1}}'.format(ALPHABET[0], word_len, word)
Now for a function to check whether a word is "good" based on its integer ID. The following method is of a similar format to id_to_word, except a counter is used to keep track of consecutive characters. The function will return False if the maximum number of identical consecutive characters is exceeded, otherwise it returns True.
def check_word(word_id, max_consecutive):
consecutive = 0
previous = None
while word_id:
rem = word_id & 0x3
if rem != previous:
consecutive = 0
consecutive += 1
if consecutive == max_consecutive + 1:
return False
word_id >>= 2
previous = rem
return True
We're effectively thinking of each word as an integer with base 4. If the Alphabet length was not a "power of 2" value, then modulo % alpha_len and integer division // alpha_len could be used in place of & log2(alpha_len) and >> log2(alpha_len) respectively, although it would take much longer.
Finally, finding all the good words for a given word_len. The advantage of using a range of integer values is that you can reduce the number of for-loops in your code from word_len to 2, albeit the outer loop is very large. This may allow for more friendly multiprocessing of your good word finding task. I have also added in a quick calculation to determine the smallest and largest IDs corresponding to good words, which helps significantly narrow down the search for good words
ALPHABET = ('a', 'b', 'c', 'd')
def find_good_words(word_len):
max_consecutive = 3
alpha_len = len(ALPHABET)
# Determine the words corresponding to the smallest and largest ids
smallest_word = '' # aaabaaabaaabaaab
largest_word = '' # dddcdddcdddcdddc
for i in range(word_len):
if (i + 1) % (max_consecutive + 1):
smallest_word = ALPHABET[0] + smallest_word
largest_word = ALPHABET[-1] + largest_word
else:
smallest_word = ALPHABET[1] + smallest_word
largest_word = ALPHABET[-2] + largest_word
# Determine the integer ids of said words
trans_table = str.maketrans({c: str(i) for i, c in enumerate(ALPHABET)})
smallest_id = int(smallest_word.translate(trans_table), alpha_len) # 1077952576
largest_id = int(largest_word.translate(trans_table), alpha_len) # 3217014720
# Find and store the id's of "good" words
counter = 0
goodies = []
for i in range(smallest_id, largest_id + 1):
if check_word(i, max_consecutive):
goodies.append(i)
counter += 1
In this loop I have specifically stored the word's ID as opposed to the actual word itself incase you are going to use the words for further processing. However, if you are just after the words then change the second to last line to read goodies.append(id_to_word(i, word_len)).
NOTE: I receive a MemoryError when attempting to store all good IDs for word_len >= 14. I suggest writing these IDs/words to a file of some sort!
I'm trying to count the number of differences between two imported strings (seq1 and seq2, import code not listed), but am getting no result when running the program. I want the output to read something like "2 differences." Not sure where I'm going wrong...
def difference (seq1, seq2):
count = 0
for i in seq1:
if seq1[i] != seq2[i]:
count += 1
return (count)
print (count, "differences")
You could do this pretty flatly with a generator expression
count = sum(1 for a, b in zip(seq1, seq2) if a != b)
If the sequences are of a different length, then you may consider the difference in length to be difference in content (I would). In that case, tag on an extra piece to account for it
count = sum(1 for a, b in zip(seq1, seq2) if a != b) + abs(len(seq1) - len(seq2))
Another weirdish way to write that which takes advantage of True being 1 and False being 0 is:
sum(a != b for a, b in zip(seq1, seq2))+ abs(len(seq1) - len(seq2))
zip is a python builtin that allows you to iterate over two sequences at once. It will also terminate on the shortest sequence, observe:
>>> seq1 = 'hi'
>>> seq2 = 'world'
>>> for a, b in zip(seq1, seq2):
... print('a =', a, '| b =', b)
...
a = h | b = w
a = i | b = o
This will evaluate similar to sum([1, 1, 1]) where each 1 represents a difference between the two sequences. The if a != b filter causes the generator to only produce a value when a and b differ.
When you say for i in seq1 you are iterating over the characters, not the indexes. You can use enumerate by saying for i, ch in enumerate(seq1) instead.
Or even better, use the standard function zip to go through both sequences at once.
You also have a problem because you return before you print. Probably your return needs to be moved down and unindented.
in your script there are to mistakes
"i" should be integer, not char
"return" should be in function the same level as print, not in cycle "for"
try not to use "print" in such way in functions
here is working version:
def difference (seq1, seq2):
count = 0
for i in range(len(seq1)):
if seq1[i] != seq2[i]:
count += 1
return (count)
So I had to do what you are asking to do and I came up with a very simple solution. Mine is a little different because I check the string to see which is bigger and put them in the correct variable for comparison later. All done with Vanilla python:
#Declare Variables
a='Here is my first string'
b='Here is my second string'
notTheSame=0
count=0
#Check which string is bigger and put the bigger string in C and smaller string in D
if len(a) >= len(b):
c=a
d=b
if len(b) > len(a):
d=a
c=b
#While the counter is less than the length of the longest string, compare each letter.
while count < len(c):
if count == len(d):
break
if c[count] != d[count]:
print(c[count] + " not equal to " + d[count])
notTheSame = notTheSame + 1
else:
print(c[count] + " is equal to " + d[count])
count=count+1
#the below output is a count of all the differences + the difference between the 2 strings
print("Number of Differences: " + str(len(c)-len(d)+notTheSame))
Correct code would be:
def difference(seq1, seq2):
count = 0
for i in range(len(seq1)):
if seq1[i] != seq2[i]:
count += 1
return count
First the return statement is done at the end of the function, therefore it should not be part of the for loop or the for loop would just run once.
Second the for loop wasn't correct because you weren't really telling giving the for loop an integer, therefore the correct code would be to give it a range the length of seq1, so:
for i in range(len(seq1)):
So I need to create 3 functions for this assignment.
The first should do the following
numCol(): When dealing with plain text (and data written in plain text), it can be helpful to have a “scale” that indicates the columns in which characters appear. We create a scale using two lines (or “rows”). In the second row, we print 1234567890 repeatedly. In the first row (i.e., the line above the second row), we write the “tens” digits above the zeros in the second row as shown in the listing below. This function takes one argument, an integer, and prints a scale the length of your quote. It doesn't return anything.
The second should
docQuote(): Takes three arguments: 1) the quote as a string, 2) the slice start value, and 3) the slice end value. It returns the doctored string.
The third should
main(): Takes no arguments and returns nothing. Prompts user for original quote and number
of slices needed. Then in a for-loop, calls numCol() in such a way that the scale is the length of the quote, prompts the user for the start and end slice values (recall that the end value isn’t included in the slice), and then calls docQuote(). Finally, it prints the final doctored quote.
If the program is correct its output should look as follows:
1. Enter quote: Money is the root of all evil.
2. Enter the number of slices needed: 2
3. 1 2
4. 012345678901234567890123456789
5. Money is the root of all evil.
6. Start and end for slicing separated by a comma: 8, 20
7. 1
8. 012345678901234567
9. Money is all evil.
10. Start and end for slicing separated by a comma: 12, 17
11. -> Money is all.
What I have so far: (Will update if I figure something out)
def numCol(x):
col=[]
for i in range(1,(round(n)//10)+1):
col.append(str(i))
print(" "," ".join(col),end="")
def docQuote(x,y,z):
return
def main():
x=input("Enter quote: ")
y=int(input("Enter the number of slices needed: "))
numCol(len(x)-1)
print(x)
main()
Ok: you have to define a function called numCol which takes one integer argument:
def numCol(n):
then you have to print a line consisting of n characters, where every tenth character is an incrementing integer and every other character is a space.
chars = []
for i in range(1, n+1):
if i % 10:
chars.append(" ")
else:
chars.append(str((i % 100) // 10))
print(''.join(chars))
and finally a row consisting of 'n' chars, being 1234567890 repeating:
chars = []
for i in range(1, n+1):
chars.append(str(i % 10))
print(''.join(chars))
which then runs as
>>> numCol(65)
1 2 3 4 5 6
12345678901234567890123456789012345678901234567890123456789012345
Edit:
In response to #AdamSmith:
Let's see some actual numbers:
from textwrap import dedent
from timeit import Timer
test_statements = [
(
"n = 65",
"""
# as a for-loop
chars = []
for i in xrange(1, n+1):
if i % 10:
chars.append(" ")
else:
chars.append(str((i % 100) // 10))
"""
),
(
"n = 65",
"""
# as a list comprehension
chars = [" " if i%10 else str((i%100)//10) for i in xrange(1,n+1)]
"""
),
(
"n = 65",
"""
# extra cost of list-to-string
chars = [" " if i%10 else str((i%100)//10) for i in xrange(1,n+1)]
s = ''.join(chars)
"""
),
(
"n = 65",
"""
# vs cost of generator-to-string
chars = (" " if i%10 else str((i%100)//10) for i in xrange(1,n+1))
s = ''.join(chars)
"""
),
(
"s = ' 1 2 3 4 5 6 '",
"""
# cost of actually displaying string
print(s)
"""
)
]
for setup,run in test_statements:
res = Timer(dedent(run), setup)
times = res.repeat() # 3 * 1000000 runs
print("{:7.1f}".format(min(times)) # time of one loop in microseconds
on my system (i5-760, Win7 x64, Python 2.7.5 64bit) this gives
15.1 # for-loop -> list of chars
10.7 # list comprehension -> list of chars
11.4 # list comprehension -> string
13.6 # generator expression -> string
132.1 # print the string
Conclusions:
a list comprehension is 29% faster than a for-loop at building a list of characters
a generator expression is 19.6% slower than a list comprehension at building a list of characters and joining to a string
it's pretty much irrelevant, because actually printing the output takes 9 times longer than generating it with any of these methods - by the time you print the string out, using a list comprehension (fastest) is just 2.9% faster than the for-loop (slowest).
#user3482104
If you really want to avoid if ... else, you can do
if stmt:
do_a()
if not stmt: # same effect as 'else:'
do_b()
but be aware that this has to evaluate stmt twice where else only evaluates it once.
Also, because both loops are iterating over the same range (same start/end values), you can combine the loops:
def numCol(n):
firstline = []
secondline = []
for i in range(1, n+1):
i %= 100
tens, ones = i // 10, i % 10
if ones: # ones != 0
firstline.append(" ")
if not ones: # ones == 0 # <= 'else:'
firstline.append(str(tens))
secondline.append(str(ones))
print(''.join(firstline))
print(''.join(secondline))