How to improve python dict performance?

How to improve python dict performance? - python

I recently coded a python solution using dictoionaries which got TLE verdict. The solution is exactly similar to a multiset solution in c++ which works. So, we are sure that the logic is correct, but the implementation is not upto the mark.
The problem description for understanding below code (http://codeforces.com/contest/714/problem/C):
For each number we need to get a string of 0s and 1s such that i'th digit is 0/1 if respective ith digit in number is even/odd.
We need to maintain the count of number that have the same mapping that is given by above described point.
Any hints/pointer to improve the performance of below code? It gave TLE (Time Limit Exceeded) for a large test case(http://codeforces.com/contest/714/submission/20594344).
from collections import defaultdict
def getPattern(s):
return ''.join(list(s.zfill(19)))
def getSPattern(s):
news = s.zfill(19)
patlist = [ '0' if (int(news[i])%2 == 0) else '1' for i in range(19) ]
return "".join(patlist)
t = int(raw_input())
pat = defaultdict(str) # holds strings as keys and int as value
for i in range(0, t):
oper, num = raw_input().strip().split(' ')
if oper == '+' :
pattern = getSPattern(str(num))
if pattern in pat:
pat[pattern] += 1
else:
pat[pattern] = 1
elif oper == '-' :
pattern = getSPattern(str(num))
pat[pattern] = max( pat[pattern] - 1, 0)
elif oper == '?' :
print pat.get(getPattern(num) , 0 )

I see lots of small problems with your code but can't say if they add up to significant performance issues:
You've set up, and used, your defaultdict() incorrectly:
pat = defaultdict(str)
...
if pattern in pat:
pat[pattern] += 1
else:
pat[pattern] = 1
The argument to the defaultdict() constructor should be the type of the values, not the keys. Once you've set up your defaultdict properly, you can simply do:
pat = defaultdict(int)
...
pat[pattern] += 1
As the value will now default to zero if the pattern isn't there already.
Since the specification says:
 -  ai — delete a single occurrence of non-negative integer ai from the multiset. It's guaranteed, that there is at least one ai in the
multiset.
Then this:
pat[pattern] = max( pat[pattern] - 1, 0)
can simply be this:
pat[pattern] -= 1
You're working with 19 character strings but since the specification says the numbers will be less than 10 ** 18, you can work with 18 character strings instead.
getSPattern() does a zfill() and then processes the string, it should do it in the reverse order, process the string and then zfill() it, as there's no need to run the logic on the leading zeros.
We don't need the overhead of int() to convert the characters to numbers:
(int(news[i])%2 == 0)
Consider using ord() instead as the ASCII values of the digits have the same parity as the digits themselves: ord('4') -> 52
And you don't need to loop over the indexes, you can simply loop over the characters.
Below is my rework of your code with the above changes, see if it still works (!) and gains you any performance:
from collections import defaultdict
def getPattern(string):
return string.zfill(18)
def getSPattern(string):
# pattern_list = (('0', '1')[ord(character) % 2] for character in string)
pattern_list = ('0' if ord(character) % 2 == 0 else '1' for character in string)
return ("".join(pattern_list)).zfill(18)
patterns = defaultdict(int) # holds keys as strings as and values as int
text = int(raw_input())
for _ in range(text):
operation, number = raw_input().strip().split()
if operation == '+':
pattern = getSPattern(number)
patterns[pattern] += 1
elif operation == '-':
pattern = getSPattern(number)
patterns[pattern] -= 1
elif operation == '?':
print patterns.get(getPattern(number), 0)

With the explanation already done by #cdlane, I just need to add my rewrite of getSPattern where I think the bulk of time is spent. As per my initial comment this is available on https://eval.in/641639
def getSPattern(s):
patlist = ['0' if c in ['0', '2', '4', '6', '8'] else '1' for c in s]
return "".join(patlist).zfill(19)
Using zfill(18) might marginally spare you some time.

Related

Python: How to find all ways to decode a string?

I'm trying to solve this problem but it fails with input "226".
Problem:
A message containing letters from A-Z is being encoded to numbers using the following mapping:
'A' -> 1
'B' -> 2
...
'Z' -> 26
Given a non-empty string containing only digits, determine the total number of ways to decode it.
My Code:
class Solution:
def numDecodings(self, s: str) -> int:
decode =[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]
ways = []
for d in decode:
for i in s:
if str(d) == s or str(d) in s:
ways.append(d)
if int(i) in decode:
ways.append(str(i))
return len(ways)
My code returns 2. It only takes care of combinations (22,6) and (2,26).
It should be returning 3, so I'm not sure how to take care of the (2,2,6) combination.

Looks like this problem can be broken down into many subproblems thus can be solved recursively
Subproblem 1 = when the last digit of the string is valid ( i.e. non zero number ) for that you can just recur for (n-1) digits left
if s[n-1] > "0":
count = number_of_decodings(s,n-1)
Subproblem 2 = when last 2 digits form a valid number ( less then 27 ) for that you can just recur for remaining (n-2) digits
if (s[n - 2] == '1' or (s[n - 2] == '2' and s[n - 1] < '7') ) :
count += number_of_decodings(s, n - 2)
Base Case = length of the string is 0 or 1
if n == 0 or n == 1 :
return 1
EDIT: A quick searching on internet , I found another ( more interesting ) method to solve this particular problem which uses dynamic programming to solve this problem
# A Dynamic Programming based function
# to count decodings
def countDecodingDP(digits, n):
count = [0] * (n + 1); # A table to store
# results of subproblems
count[0] = 1;
count[1] = 1;
for i in range(2, n + 1):
count[i] = 0;
# If the last digit is not 0, then last
# digit must add to the number of words
if (digits[i - 1] > '0'):
count[i] = count[i - 1];
# If second last digit is smaller than 2
# and last digit is smaller than 7, then
# last two digits form a valid character
if (digits[i - 2] == '1' or
(digits[i - 2] == '2' and
digits[i - 1] < '7') ):
count[i] += count[i - 2];
return count[n];
the above solution solves the problem in complexity of O(n) and uses the similar method as that of fibonacci number problem
source: https://www.geeksforgeeks.org/count-possible-decodings-given-digit-sequence/

This seemed like a natural for recursion. Since I was bored, and the first answer didn't use recursion and didn't return the actual decodings, I thought there was room for improvement. For what it's worth...
def encodings(str, prefix = ''):
encs = []
if len(str) > 0:
es = encodings(str[1:], (prefix + ',' if prefix else '') + str[0])
encs.extend(es)
if len(str) > 1 and int(str[0:2]) <= 26:
es = encodings(str[2:], (prefix + ',' if prefix else '') + str[0:2])
encs.extend(es)
return encs if len(str) else [prefix]
This returns a list of the possible decodings. To get the count, you just take the length of the list. Here a sample run:
encs = encodings("123")
print("{} {}".format(len(encs), encs))
with result:
3 ['1,2,3', '1,23', '12,3']
Another sample run:
encs = encodings("123123")
print("{} {}".format(len(encs), encs))
with result:
9 ['1,2,3,1,2,3', '1,2,3,1,23', '1,2,3,12,3', '1,23,1,2,3', '1,23,1,23', '1,23,12,3', '12,3,1,2,3', '12,3,1,23', '12,3,12,3']

How to find the max number of times a sequence of characters repeats consecutively in a string? [duplicate]

This question already has answers here:
How to count consecutive repetitions of a substring in a string?
(4 answers)
Closed 1 year ago.
I'm working on a cs50/pset6/dna project. I'm struggling with finding a way to analyze a sequence of strings, and gather the maximum number of times a certain sequence of characters repeats consecutively. Here is an example:
String: JOKHCNHBVDBVDBVDJHGSBVDBVD
Sequence of characters I should look for: BVD
Result: My function should be able to return 3, because in one point the characters BVD repeat three times consecutively, and even though it repeats again two times, I should look for the time that it repeats the most number of times.

It's a bit lame, but one "brute-force"ish way would be to just check for the presence of the longest substring possible. As soon as a substring is found, break out of the loop:
EDIT - Using a function might be more straight forward:
def get_longest_repeating_pattern(string, pattern):
if not pattern:
return ""
for i in range(len(string)//len(pattern), 0, -1):
current_pattern = pattern * i
if current_pattern in string:
return current_pattern
return ""
string = "JOKHCNHBVDBVDBVDJHGSBVDBVD"
pattern = "BVD"
longest_repeating_pattern = get_longest_repeating_pattern(string, pattern)
print(len(longest_repeating_pattern))
EDIT - explanation:
First, just a simple for-loop that starts at a larger number and goes down to a smaller number. For example, we start at 5 and go down to 0 (but not including 0), with a step size of -1:
>>> for i in range(5, 0, -1):
print(i)
5
4
3
2
1
>>>
if string = "JOKHCNHBVDBVDBVDJHGSBVDBVD", then len(string) would be 26, if pattern = "BVD", then len(pattern) is 3.
Back to my original code:
for i in range(len(string)//len(pattern), 0, -1):
Plugging in the numbers:
for i in range(26//3, 0, -1):
26//3 is an integer division which yields 8, so this becomes:
for i in range(8, 0, -1):
So, it's a for-loop that goes from 8 to 1 (remember, it doesn't go down to 0). i takes on the new value for each iteration, first 8 , then 7, etc.
In Python, you can "multiply" strings, like so:
>>> pattern = "BVD"
>>> pattern * 1
'BVD'
>>> pattern * 2
'BVDBVD'
>>> pattern * 3
'BVDBVDBVD'
>>>

A slightly less bruteforcey solution:
string = 'JOKHCNHBVDBVDBVDJHGSBVDBVD'
key = 'BVD'
len_k = len(key)
max_l = 0
passes = 0
curr_len=0
for i in range(len(string) - len_k + 1): # split the string into substrings of same len as key
if passes > 0: # If key was found in previous sequences, pass ()this way, if key is 'BVD', we will ignore 'VD.' and 'D..'
passes-=1
continue
s = string[i:i+len_k]
if s == key:
curr_len+=1
if curr_len > max_l:
max_l=curr_len
passes = len(key)-1
if prev_s == key:
if curr_len > max_l:
max_l=curr_len
else:
curr_len=0
prev_s = s
print(max_l)

You can do that very easily, elegantly and efficiently using a regex.
We look for all sequences of at least one repetition of your search string. Then, we just need to take the maximum length of these sequences, and divide by the length of the search string.
The regex we use is '(:?<your_sequence>)+': at least one repetition (the +) of the group (<your_sequence>). The :? is just here to make the group non capturing, so that findall returns the whole match, and not just the group.
In case there is no match, we use the default parameter of the max function to return 0.
The code is very short, then:
import re
def max_consecutive_repetitions(search, data):
search_re = re.compile('(?:' + search + ')+')
return max((len(seq) for seq in search_re.findall(data)), default=0) // len(search)
Sample run:
print(max_consecutive_repetitions("BVD", "JOKHCNHBVDBVDBVDJHGSBVDBVD"))
# 3

This is my contribution, I'm not a professional but it worked for me (sorry for bad English)
results = {}
# Loops through all the STRs
for i in range(1, len(reader.fieldnames)):
STR = reader.fieldnames[i]
j = 0
s=0
pre_s = 0
# Loops through all the characters in sequence.txt
while j < (len(sequence) - len(STR)):
# checks if the character we are currently looping is the same than the first STR character
if STR[0] == sequence[j]:
# while the sub-string since j to j - STR lenght is the same than STR, I called this a streak
while sequence[j:(j + len(STR))] == STR:
# j skips to the end of sub-string
j += len(STR)
# streaks counter
s += 1
# if s > 0 means that that the whole STR and sequence coincided at least once
if s > 0:
# save the largest streak as pre_s
if s > pre_s:
pre_s = s
# restarts the streak counter to continue exploring the sequence
s=0
j += 1
# assigns pre_s value to a dictionary with the current STR as key
results[STR] = pre_s
print(results)

Compressing multiple nested `for` loops

Similar to this and many other questions, I have many nested loops (up to 16) of the same structure.
Problem: I have 4-letter alphabet and want to get all possible words of length 16. I need to filter those words. These are DNA sequences (hence 4 letter: ATGC), filtering rules are quite simple:
no XXXX substrings (i.e. can't have same letter in a row more than 3 times, ATGCATGGGGCTA is "bad")
specific GC content, that is number of Gs + number of Cs should be in specific range (40-50%). ATATATATATATA and GCGCGCGCGCGC are bad words
itertools.product will work for that, but data structure here gonna be giant (4^16 = 4*10^9 words)
More importantly, if I do use product, then I still have to go through each element to filter it out. Thus I will have 4 billion steps times 2
My current solution is nested for loops
alphabet = ['a','t','g','c']
for p1 in alphabet:
for p2 in alphabet:
for p3 in alphabet:
...skip...
for p16 in alphabet:
word = p1+p2+p3+...+p16
if word_is_good(word):
good_words.append(word)
counter+=1
Is there good pattern to program that without 16 nested loops? Is there a way to parallelize it efficiently (on multi-core or multiple EC2 nodes)
Also with that pattern i can plug word_is_good? check inside middle of the loops: word that starts badly is bad
...skip...
for p3 in alphabet:
word_3 = p1+p2+p3
if not word_is_good(word_3):
break
for p4 in alphabet:
...skip...

from itertools import product, islice
from time import time
length = 16
def generate(start, alphabet):
"""
A recursive generator function which works like itertools.product
but restricts the alphabet as it goes based on the letters accumulated so far.
"""
if len(start) == length:
yield start
return
gcs = start.count('g') + start.count('c')
if gcs >= length * 0.5:
alphabet = 'at'
# consider the maximum number of Gs and Cs we can have in the end
# if we add one more A/T now
elif length - len(start) - 1 + gcs < length * 0.4:
alphabet = 'gc'
for c in alphabet:
if start.endswith(c * 3):
continue
for string in generate(start + c, alphabet):
yield string
def brute_force():
""" Straightforward method for comparison """
lower = length * 0.4
upper = length * 0.5
for s in product('atgc', repeat=length):
if lower <= s.count('g') + s.count('c') <= upper:
s = ''.join(s)
if not ('aaaa' in s or
'tttt' in s or
'cccc' in s or
'gggg' in s):
yield s
def main():
funcs = [
lambda: generate('', 'atgc'),
brute_force
]
# Testing performance
for func in funcs:
# This needs to be big to get an accurate measure,
# otherwise `brute_force` seems slower than it really is.
# This is probably because of how `itertools.product`
# is implemented.
count = 100000000
start = time()
for _ in islice(func(), count):
pass
print(time() - start)
# Testing correctness
global length
length = 12
for x, y in zip(*[func() for func in funcs]):
assert x == y, (x, y)
main()
On my machine, generate was just a bit faster than brute_force, at about 390 seconds vs 425. This was pretty much as fast as I could make them. I think the full thing would take about 2 hours. Of course, actually processing them will take much longer. The problem is that your constraints don't reduce the full set much.
Here's an example of how to use this in parallel across 16 processes:
from multiprocessing.pool import Pool
alpha = 'atgc'
def generate_worker(start):
start = ''.join(start)
for s in generate(start, alpha):
print(s)
Pool(16).map(generate_worker, product(alpha, repeat=2))

Since you happen to have an alphabet of length 4 (or any "power of 2 integer"), the idea of using and integer ID and bit-wise operations comes to mind instead of checking for consecutive characters in strings. We can assign an integer value to each of the characters in alphabet, for simplicity lets use the index corresponding to each letter.
Example:
6546354310 = 33212321033134 = 'aaaddcbcdcbaddbd'
The following function converts from a base 10 integer to a word using alphabet.
def id_to_word(word_id, word_len):
word = ''
while word_id:
rem = word_id & 0x3 # 2 bits pet letter
word = ALPHABET[rem] + word
word_id >>= 2 # Bit shift to the next letter
return '{2:{0}>{1}}'.format(ALPHABET[0], word_len, word)
Now for a function to check whether a word is "good" based on its integer ID. The following method is of a similar format to id_to_word, except a counter is used to keep track of consecutive characters. The function will return False if the maximum number of identical consecutive characters is exceeded, otherwise it returns True.
def check_word(word_id, max_consecutive):
consecutive = 0
previous = None
while word_id:
rem = word_id & 0x3
if rem != previous:
consecutive = 0
consecutive += 1
if consecutive == max_consecutive + 1:
return False
word_id >>= 2
previous = rem
return True
We're effectively thinking of each word as an integer with base 4. If the Alphabet length was not a "power of 2" value, then modulo % alpha_len and integer division // alpha_len could be used in place of & log2(alpha_len) and >> log2(alpha_len) respectively, although it would take much longer.
Finally, finding all the good words for a given word_len. The advantage of using a range of integer values is that you can reduce the number of for-loops in your code from word_len to 2, albeit the outer loop is very large. This may allow for more friendly multiprocessing of your good word finding task. I have also added in a quick calculation to determine the smallest and largest IDs corresponding to good words, which helps significantly narrow down the search for good words
ALPHABET = ('a', 'b', 'c', 'd')
def find_good_words(word_len):
max_consecutive = 3
alpha_len = len(ALPHABET)
# Determine the words corresponding to the smallest and largest ids
smallest_word = '' # aaabaaabaaabaaab
largest_word = '' # dddcdddcdddcdddc
for i in range(word_len):
if (i + 1) % (max_consecutive + 1):
smallest_word = ALPHABET[0] + smallest_word
largest_word = ALPHABET[-1] + largest_word
else:
smallest_word = ALPHABET[1] + smallest_word
largest_word = ALPHABET[-2] + largest_word
# Determine the integer ids of said words
trans_table = str.maketrans({c: str(i) for i, c in enumerate(ALPHABET)})
smallest_id = int(smallest_word.translate(trans_table), alpha_len) # 1077952576
largest_id = int(largest_word.translate(trans_table), alpha_len) # 3217014720
# Find and store the id's of "good" words
counter = 0
goodies = []
for i in range(smallest_id, largest_id + 1):
if check_word(i, max_consecutive):
goodies.append(i)
counter += 1
In this loop I have specifically stored the word's ID as opposed to the actual word itself incase you are going to use the words for further processing. However, if you are just after the words then change the second to last line to read goodies.append(id_to_word(i, word_len)).
NOTE: I receive a MemoryError when attempting to store all good IDs for word_len >= 14. I suggest writing these IDs/words to a file of some sort!

Generate tailor made thousand delimiter

I want to tailor-make a thousand delimiter in Python. I am generating HTML and want to use   as thousand separator. (It would look like: 1 000 000)
So far I have found the following way to add a , as a separator:
>>> '{0:,}'.format(1000000)
'1,000,000'
But I don't see to be able to use a similar construction to get another delimiter. '{0:|}'.format(1000000) for example does not work. Is there an easy way to use anything (i.e.,  ) as a thousand separator?

Well, you can always do this:
'{0:,}'.format(1000000).replace(',', '|')
Result: '1|000|000'
Here's a simple algorithm for the same. The previous version of it (ThSep two revisions back) didn't handle long separators like  :
def ThSep(num, sep = ','):
num = int(num)
if not num:
return '0'
ret = ''
dig = 0
neg = False
if num < 0:
num = -num
neg = True
while num != 0:
dig += 1
ret += str(num % 10)
if (dig == 3) and (num / 10):
for ch in reversed(sep):
ret += ch
dig = 0
num /= 10
if neg:
ret += '-'
return ''.join(reversed(ret))
Call it with ThSep(1000000, ' ') or ThSep(1000000, '|') to get the result you want.
It's about 4 times slower than the first method, though, so you can try rewriting this as a C extension for production code. This is only if speed matters much. I converted 2 000 000 negative and positive numbers in half a minute for the test.

There is no built in way to do this, But you can use str.replace, if the number is the only present value
>>> '{0:,}'.format(1000000).replace(',','|')
'1|000|000'
This is mentioned in PEP 378
The proposal works well with floats, ints, and decimals. It also allows easy substitution for other separators. For example:
format(n, "6,d").replace(",", "_")

Finding the length of longest repeating?

I have tried plenty of different methods to achieve this, and I don't know what I'm doing wrong.
reps=[]
len_charac=0
def longest_charac(strng)
for i in range(len(strng)):
if strng[i] == strng[i+1]:
if strng[i] in reps:
reps.append(strng[i])
len_charac=len(reps)
return len_charac

Remember in Python counting loops and indexing strings aren't usually needed. There is also a builtin max function:
def longest(s):
maximum = count = 0
current = ''
for c in s:
if c == current:
count += 1
else:
count = 1
current = c
maximum = max(count,maximum)
return maximum
Output:
>>> longest('')
0
>>> longest('aab')
2
>>> longest('a')
1
>>> longest('abb')
2
>>> longest('aabccdddeffh')
3
>>> longest('aaabcaaddddefgh')
4

Simple solution:
def longest_substring(strng):
len_substring=0
longest=0
for i in range(len(strng)):
if i > 0:
if strng[i] != strng[i-1]:
len_substring = 0
len_substring += 1
if len_substring > longest:
longest = len_substring
return longest
Iterates through the characters in the string and checks against the previous one. If they are different then the count of repeating characters is reset to zero, then the count is incremented. If the current count beats the current record (stored in longest) then it becomes the new longest.

Compare two things and there is one relation between them:
'a' == 'a'
True
Compare three things, and there are two relations:
'a' == 'a' == 'b'
True False
Combine these ideas - repeatedly compare things with the things next to them, and the chain gets shorter each time:
'a' == 'a' == 'b'
True == False
False
It takes one reduction for the 'b' comparison to be False, because there was one 'b'; two reductions for the 'a' comparison to be False because there were two 'a'. Keep repeating until the relations are all all False, and that is how many consecutive equal characters there were.
def f(s):
repetitions = 0
while any(s):
repetitions += 1
s = [ s[i] and s[i] == s[i+1] for i in range(len(s)-1) ]
return repetitions
>>> f('aaabcaaddddefgh')
4
NB. matching characters at the start become True, only care about comparing the Trues with anything, and stop when all the Trues are gone and the list is all Falses.
It can also be squished into a recursive version, passing the depth in as an optional parameter:
def f(s, depth=1):
s = [ s[i] and s[i]==s[i+1] for i in range(len(s)-1) ]
return f(s, depth+1) if any(s) else depth
>>> f('aaabcaaddddefgh')
4
I stumbled on this while trying for something else, but it's quite pleasing.

You can use itertools.groupby to solve this pretty quickly, it will group characters together, and then you can sort the resulting list by length and get the last entry in the list as follows:
from itertools import groupby
print(sorted([list(g) for k, g in groupby('aaabcaaddddefgh')],key=len)[-1])
This should give you:
['d', 'd', 'd', 'd']

This works:
def longestRun(s):
if len(s) == 0: return 0
runs = ''.join('*' if x == y else ' ' for x,y in zip(s,s[1:]))
starStrings = runs.split()
if len(starStrings) == 0: return 1
return 1 + max(len(stars) for stars in starStrings)
Output:
>>> longestRun("aaabcaaddddefgh")
4

First off, Python is not my primary language, but I can still try to help.
1) you look like you are exceeding the bounds of the array. On the last iteration, you check the last character against the character beyond the last character. This normally leads to undefined behavior.
2) you start off with an empty reps[] array and compare every character to see if it's in it. Clearly, that check will fail every time and your append is within that if statement.

def longest_charac(string):
longest = 0
if string:
flag = string[0]
tmp_len = 0
for item in string:
if item == flag:
tmp_len += 1
else:
flag = item
tmp_len = 1
if tmp_len > longest:
longest = tmp_len
return longest
This is my solution. Maybe it will help you.

Just for context, here is a recursive approach that avoids dealing with loops:
def max_rep(prev, text, reps, rep=1):
"""Recursively consume all characters in text and find longest repetition.
Args
prev: string of previous character
text: string of remaining text
reps: list of ints of all reptitions observed
rep: int of current repetition observed
"""
if text == '': return max(reps)
if prev == text[0]:
rep += 1
else:
rep = 1
return max_rep(text[0], text[1:], reps + [rep], rep)
Tests:
>>> max_rep('', 'aaabcaaddddefgh', [])
4
>>> max_rep('', 'aaaaaabcaadddddefggghhhhhhh', [])
7

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to improve python dict performance? - python

Related

Python: How to find all ways to decode a string?

How to find the max number of times a sequence of characters repeats consecutively in a string? [duplicate]

Compressing multiple nested `for` loops

Generate tailor made thousand delimiter

Finding the length of longest repeating?

Categories

Resources