Testing whether a tuple has all distinct elements - python

I was seeking for a way to test whether a tuple has all distinct elements - so to say, it is a set and ended up with this quick and dirty solution.
def distinct ( tup):
n=0
for t in tup:
for k in tup:
#print t,k,n
if (t == k ):
n = n+1
if ( n != len(tup)):
return False
else :
return True
print distinct((1,3,2,10))
print distinct((3,3,4,2,7))
Any thinking error? Is there a builtin on tuples?

You can very easily do it as:
len(set(tup))==len(tup)
This creates a set of tup and checks if it is the same length as the original tup. The only case in which they would have the same length is if all elements in tup were unique
Examples
>>> a = (1,2,3)
>>> print len(set(a))==len(a)
True
>>> b = (1,2,2)
>>> print len(set(b))==len(b)
False
>>> c = (1,2,3,4,5,6,7,8,5)
>>> print len(set(c))==len(c)
False

In the majority of the cases, where all of the items in the tuple are hashable and support comparison (using == operator) with each other, #sshashank124's solution is what you're after:
len(set(tup))==len(tup)
For the example you posted, i.e. a tuple of ints, that would do.
Else, if the items are not hashable, but do have order defined on them (support '==', '<' operators, etc.), the best you can do is sorting them (O(NlogN) worst case), and then look for adjacent equal elements:
sorted_tup = sorted(tup)
all( x!=y for x,y in zip(sorted_tup[:-1],sorted_tup[1:]) )
Else, if the items only support equality comparison (==), the best you can do is the O(N^2) worst case algorithm, i.e. comparing every item to every other.
I would implement it this way, using itertools.combinations:
def distinct(tup):
for x,y in itertools.combinations(tup, 2):
if x == y:
return False
return True
Or as a one-liner:
all( x!=y for x,y in itertools.combinations(tup, 2) )

What about using early_exit:
This will be faster:
def distinct(tup):
s = set()
for x in tup:
if x in s:
return False
s.add(x)
return True
Ok: Here is the test and profiling with 1000 int:
#!/usr/bin/python
def distinct1(tup):
n=0
for t in tup:
for k in tup:
if (t == k ):
n = n+1
if ( n != len(tup)):
return False
else :
return True
def distinct2(tup):
return len(set(tup))==len(tup)
def distinct3(tup):
s = set()
for x in tup:
if x in s:
return False
s.add(x)
return True
import cProfile
from faker import Faker
from timeit import Timer
s = Faker()
func = [ distinct1, distinct2, distinct3 ]
tuples = tuple([ s.random_int(min=1, max=9999) for x in range(1000) ])
for fun in func:
t = Timer(lambda: fun(tuples))
print fun.__name__, cProfile.run('t.timeit(number=1000)')
Output: distinct 2 and distinct3 are almost the same.
distinct1 3011 function calls in 60.289 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 60.289 60.289 <string>:1(<module>)
1000 60.287 0.060 60.287 0.060 compare_tuple.py:3(distinct1)
1000 0.001 0.000 60.288 0.060 compare_tuple.py:34(<lambda>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 60.289 60.289 timeit.py:178(timeit)
1 0.001 0.001 60.289 60.289 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
1000 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
distinct2 4011 function calls in 0.053 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.053 0.053 <string>:1(<module>)
1000 0.052 0.000 0.052 0.000 compare_tuple.py:14(distinct2)
1000 0.000 0.000 0.053 0.000 compare_tuple.py:34(<lambda>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 0.053 0.053 timeit.py:178(timeit)
1 0.000 0.000 0.053 0.053 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
2000 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
distinct3 183011 function calls in 0.072 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.072 0.072 <string>:1(<module>)
1000 0.051 0.000 0.070 0.000 compare_tuple.py:17(distinct3)
1000 0.002 0.000 0.072 0.000 compare_tuple.py:34(<lambda>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 0.072 0.072 timeit.py:178(timeit)
1 0.000 0.000 0.072 0.072 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
181000 0.019 0.000 0.019 0.000 {method 'add' of 'set' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None

a = (1,2,3,4,5,6,7,8,9)
b = (1,2,3,4,5,6,6,7,8,9)
def unique(tup):
i = 0
while i < len(tup):
if tup[i] in tup[i+1:]:
return False
i = i + 1
return True
unique(a)
True
unique(b)
False
Readable and works with hashable items. People seem adverse to "while". This also allows for "special cases" and filtering on attributes.

Related

Leetcode 1155. dictionary based memoization vs LRU Cache

I was solving leetcode 1155 which is about number of dice rolls with target sum. I was using dictionary-based memorization. Here's the exact code:
class Solution:
def numRollsToTarget(self, dices: int, faces: int, target: int) -> int:
dp = {}
def ways(t, rd):
if t == 0 and rd == 0: return 1
if t <= 0 or rd <= 0: return 0
if dp.get((t,rd)): return dp[(t,rd)]
dp[(t,rd)] = sum(ways(t-i, rd-1) for i in range(1,faces+1))
return dp[(t,rd)]
return ways(target, dices)
But this solution is invariably timing out for a combination of face and dices around 15*15
Then I found this solution which uses functools.lru_cache and the rest of it is exactly the same. This solution works very fast.
class Solution:
def numRollsToTarget(self, dices: int, faces: int, target: int) -> int:
from functools import lru_cache
#lru_cache(None)
def ways(t, rd):
if t == 0 and rd == 0: return 1
if t <= 0 or rd <= 0: return 0
return sum(ways(t-i, rd-1) for i in range(1,faces+1))
return ways(target, dices)
Earlier, I have compared and found that in most cases, lru_cache does not outperform dictionary-based cache by such a margin.
Can someone explain the reason why there is such a drastic performance difference between the two approaches?
First, running your OP code with cProfile and this is the report:
with print(numRollsToTarget2(4, 6, 20)) (OP version)
You can spot right away there're some heavy calls in ways genexpr and sum. That's prob. need close examinations and try to improve/reduce. Next posting is for similar memo version, but the calls is much less. And that version has passed w/o timeout.
35
2864 function calls (366 primitive calls) in 0.018 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.018 0.018 <string>:1(<module>)
1 0.000 0.000 0.001 0.001 dice_rolls.py:23(numRollsToTarget2)
1075/1 0.001 0.000 0.001 0.001 dice_rolls.py:25(ways)
1253/7 0.001 0.000 0.001 0.000 dice_rolls.py:30(<genexpr>)
1 0.000 0.000 0.018 0.018 dice_rolls.py:36(main)
21 0.000 0.000 0.000 0.000 rpc.py:153(debug)
3 0.000 0.000 0.017 0.006 rpc.py:216(remotecall)
3 0.000 0.000 0.000 0.000 rpc.py:226(asynccall)
3 0.000 0.000 0.016 0.005 rpc.py:246(asyncreturn)
3 0.000 0.000 0.000 0.000 rpc.py:252(decoderesponse)
3 0.000 0.000 0.016 0.005 rpc.py:290(getresponse)
3 0.000 0.000 0.000 0.000 rpc.py:298(_proxify)
3 0.000 0.000 0.016 0.005 rpc.py:306(_getresponse)
3 0.000 0.000 0.000 0.000 rpc.py:328(newseq)
3 0.000 0.000 0.000 0.000 rpc.py:332(putmessage)
2 0.000 0.000 0.001 0.000 rpc.py:559(__getattr__)
3 0.000 0.000 0.000 0.000 rpc.py:57(dumps)
1 0.000 0.000 0.001 0.001 rpc.py:577(__getmethods)
2 0.000 0.000 0.000 0.000 rpc.py:601(__init__)
2 0.000 0.000 0.016 0.008 rpc.py:606(__call__)
4 0.000 0.000 0.000 0.000 run.py:412(encoding)
4 0.000 0.000 0.000 0.000 run.py:416(errors)
2 0.000 0.000 0.017 0.008 run.py:433(write)
6 0.000 0.000 0.000 0.000 threading.py:1306(current_thread)
3 0.000 0.000 0.000 0.000 threading.py:222(__init__)
3 0.000 0.000 0.016 0.005 threading.py:270(wait)
3 0.000 0.000 0.000 0.000 threading.py:81(RLock)
3 0.000 0.000 0.000 0.000 {built-in method _struct.pack}
3 0.000 0.000 0.000 0.000 {built-in method _thread.allocate_lock}
6 0.000 0.000 0.000 0.000 {built-in method _thread.get_ident}
1 0.000 0.000 0.018 0.018 {built-in method builtins.exec}
6 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
9 0.000 0.000 0.000 0.000 {built-in method builtins.len}
1 0.000 0.000 0.017 0.017 {built-in method builtins.print}
179/1 0.000 0.000 0.001 0.001 {built-in method builtins.sum}
3 0.000 0.000 0.000 0.000 {built-in method select.select}
3 0.000 0.000 0.000 0.000 {method '_acquire_restore' of '_thread.RLock' objects}
3 0.000 0.000 0.000 0.000 {method '_is_owned' of '_thread.RLock' objects}
3 0.000 0.000 0.000 0.000 {method '_release_save' of '_thread.RLock' objects}
3 0.000 0.000 0.000 0.000 {method 'acquire' of '_thread.RLock' objects}
6 0.016 0.003 0.016 0.003 {method 'acquire' of '_thread.lock' objects}
3 0.000 0.000 0.000 0.000 {method 'append' of 'collections.deque' objects}
2 0.000 0.000 0.000 0.000 {method 'decode' of 'bytes' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
3 0.000 0.000 0.000 0.000 {method 'dump' of '_pickle.Pickler' objects}
2 0.000 0.000 0.000 0.000 {method 'encode' of 'str' objects}
201 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
3 0.000 0.000 0.000 0.000 {method 'getvalue' of '_io.BytesIO' objects}
3 0.000 0.000 0.000 0.000 {method 'release' of '_thread.RLock' objects}
3 0.000 0.000 0.000 0.000 {method 'send' of '_socket.socket' objects}
Then I tried to run modified/simplified version, and compare the results.
35
387 function calls (193 primitive calls) in 0.006 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.006 0.006 <string>:1(<module>)
1 0.000 0.000 0.006 0.006 dice_rolls.py:36(main)
1 0.000 0.000 0.000 0.000 dice_rolls.py:5(numRollsToTarget)
195/1 0.000 0.000 0.000 0.000 dice_rolls.py:8(dp)
21 0.000 0.000 0.000 0.000 rpc.py:153(debug)
3 0.000 0.000 0.006 0.002 rpc.py:216(remotecall)
3 0.000 0.000 0.000 0.000 rpc.py:226(asynccall)
3 0.000 0.000 0.006 0.002 rpc.py:246(asyncreturn)
3 0.000 0.000 0.000 0.000 rpc.py:252(decoderesponse)
3 0.000 0.000 0.006 0.002 rpc.py:290(getresponse)
3 0.000 0.000 0.000 0.000 rpc.py:298(_proxify)
3 0.000 0.000 0.006 0.002 rpc.py:306(_getresponse)
3 0.000 0.000 0.000 0.000 rpc.py:328(newseq)
3 0.000 0.000 0.000 0.000 rpc.py:332(putmessage)
2 0.000 0.000 0.001 0.000 rpc.py:559(__getattr__)
3 0.000 0.000 0.000 0.000 rpc.py:57(dumps)
1 0.000 0.000 0.001 0.001 rpc.py:577(__getmethods)
2 0.000 0.000 0.000 0.000 rpc.py:601(__init__)
2 0.000 0.000 0.005 0.003 rpc.py:606(__call__)
4 0.000 0.000 0.000 0.000 run.py:412(encoding)
4 0.000 0.000 0.000 0.000 run.py:416(errors)
2 0.000 0.000 0.006 0.003 run.py:433(write)
6 0.000 0.000 0.000 0.000 threading.py:1306(current_thread)
3 0.000 0.000 0.000 0.000 threading.py:222(__init__)
3 0.000 0.000 0.006 0.002 threading.py:270(wait)
3 0.000 0.000 0.000 0.000 threading.py:81(RLock)
3 0.000 0.000 0.000 0.000 {built-in method _struct.pack}
3 0.000 0.000 0.000 0.000 {built-in method _thread.allocate_lock}
6 0.000 0.000 0.000 0.000 {built-in method _thread.get_ident}
1 0.000 0.000 0.006 0.006 {built-in method builtins.exec}
6 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
9 0.000 0.000 0.000 0.000 {built-in method builtins.len}
34 0.000 0.000 0.000 0.000 {built-in method builtins.max}
1 0.000 0.000 0.006 0.006 {built-in method builtins.print}
3 0.000 0.000 0.000 0.000 {built-in method select.select}
3 0.000 0.000 0.000 0.000 {method '_acquire_restore' of '_thread.RLock' objects}
3 0.000 0.000 0.000 0.000 {method '_is_owned' of '_thread.RLock' objects}
3 0.000 0.000 0.000 0.000 {method '_release_save' of '_thread.RLock' objects}
3 0.000 0.000 0.000 0.000 {method 'acquire' of '_thread.RLock' objects}
6 0.006 0.001 0.006 0.001 {method 'acquire' of '_thread.lock' objects}
3 0.000 0.000 0.000 0.000 {method 'append' of 'collections.deque' objects}
2 0.000 0.000 0.000 0.000 {method 'decode' of 'bytes' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
3 0.000 0.000 0.000 0.000 {method 'dump' of '_pickle.Pickler' objects}
2 0.000 0.000 0.000 0.000 {method 'encode' of 'str' objects}
2 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
3 0.000 0.000 0.000 0.000 {method 'getvalue' of '_io.BytesIO' objects}
3 0.000 0.000 0.000 0.000 {method 'release' of '_thread.RLock' objects}
3 0.000 0.000 0.000 0.000 {method 'send' of '_socket.socket' objects}
The profiling codes are here:
import cProfile
from typing import List
def numRollsToTarget(d, f, target):
memo = {}
def dp(d, target):
if d == 0:
return 0 if target > 0 else 1
if (d, target) in memo:
return memo[(d, target)]
result = 0
for k in range(max(0, target-f), target):
result += dp(d-1, k)
memo[(d, target)] = result
return result
return dp(d, target) % (10**9 + 7)
def numRollsToTarget2(dices: int, faces: int, target: int) -> int:
dp = {}
def ways(t, rd):
if t == 0 and rd == 0: return 1
if t <= 0 or rd <= 0: return 0
if dp.get((t,rd)): return dp[(t,rd)]
dp[(t,rd)] = sum(ways(t-i, rd-1) for i in range(1,faces+1))
return dp[(t,rd)]
return ways(target, dices)
def numRollsToTarget3(dices: int, faces: int, target: int) -> int:
from functools import lru_cache
#lru_cache(None)
def ways(t, rd):
if t == 0 and rd == 0: return 1
if t <= 0 or rd <= 0: return 0
return sum(ways(t-i, rd-1) for i in range(1,faces+1))
return ways(target, dices)
def main():
print(numRollsToTarget(4, 6, 20))
#print(numRollsToTarget2(4, 6, 20))
#print(numRollsToTarget3(4, 6, 20)) # not faster than first
if __name__ == '__main__':
cProfile.run('main()')

Optimization of N digits all permutations sum algorithm

[Note: Programming Challenge]
Challenge description:
Input: List of Digits 1, 5, 6, 2, 7, 4
Non-repeated and no zeros(digits 1-9 are all valid). The goal is to create all permutations and sum them all. I used itertools permutations for this. I was able to do some test assertions with 7,8, and 9digits on my local machine. At 9 digits the time increases quite a bit to where the server times out at 12seconds, but my computer doesn't take that long. It just says to optmize my algorithm.
Below is an example of how arrays are formed, and itertools does this in each iteration of the inner loop.
array for the list sum (terms of arrays)
[1] 1 # arrays with only one term (7 different arrays)
[5] 5
[6] 6
[2] 2
[7] 7
[3] 3
[4] 4
[1, 5] 6 # arrays with two terms (42 different arrays)
[1, 6] 7
[1, 2] 3
[1, 7] 8
[1, 3] 4
[1, 4] 5
..... ...
[1, 5, 6] 12 # arrays with three terms(210 different arrays)
[1, 5, 2] 8
[1, 5, 7] 13
[1, 5, 3] 9
........ ...
This is what I have so far, and it gets the job done but my main bottleneck is the sum, from 8 digits to 9 digits it increases exponentially. Below function passed two tests but on the server it times out.
for i in range(1, limit + 1):
for j in permutations(digits, i):
grandSum =+ grandSum + sum(j)
And below are the results from cProfile runs, 7, 8, 9:
13778 function calls in 0.011 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.011 0.011 <string>:1(<module>)
7 0.000 0.000 0.000 0.000 GreatTotalAdditions.py:25(removefirstindex)
1 0.006 0.006 0.011 0.011 GreatTotalAdditions.py:31(gta)
1 0.000 0.000 0.011 0.011 {built-in method builtins.exec}
21 0.000 0.000 0.000 0.000 {built-in method builtins.len}
8 0.000 0.000 0.000 0.000 {built-in method builtins.print}
13699 0.004 0.000 0.004 0.000 {built-in method builtins.sum}
25 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
7 0.000 0.000 0.000 0.000 {method 'insert' of 'list' objects}
7 0.000 0.000 0.000 0.000 {method 'pop' of 'list' objects}
109701 function calls in 0.086 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.085 0.085 <string>:1(<module>)
11 0.000 0.000 0.000 0.000 GreatTotalAdditions.py:25(removefirstindex)
1 0.047 0.047 0.085 0.085 GreatTotalAdditions.py:31(gta)
1 0.000 0.000 0.086 0.086 {built-in method builtins.exec}
27 0.000 0.000 0.000 0.000 {built-in method builtins.len}
9 0.000 0.000 0.000 0.000 {built-in method builtins.print}
109600 0.038 0.000 0.038 0.000 {built-in method builtins.sum}
28 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
11 0.000 0.000 0.000 0.000 {method 'insert' of 'list' objects}
11 0.000 0.000 0.000 0.000 {method 'pop' of 'list' objects}
986553 function calls in 0.701 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.701 0.701 <string>:1(<module>)
16 0.000 0.000 0.000 0.000 GreatTotalAdditions.py:25(removefirstindex)
1 0.374 0.374 0.701 0.701 GreatTotalAdditions.py:31(gta)
1 0.000 0.000 0.701 0.701 {built-in method builtins.exec}
34 0.000 0.000 0.000 0.000 {built-in method builtins.len}
10 0.000 0.000 0.000 0.000 {built-in method builtins.print}
986409 0.327 0.000 0.327 0.000 {built-in method builtins.sum}
48 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
16 0.000 0.000 0.000 0.000 {method 'insert' of 'list' objects}
16 0.000 0.000 0.000 0.000 {method 'pop' of 'list' objects}
Below is where I would like some help:
Knowing the sum is taking most of the computation time. I came up across this math formula, example for a 3 digit number:
(3-1)! * (1 + 2 + 3) * (111)
And my implementation is below, bare in mind I'm new to python:
def sum_them_quick(digits):
ones = ""
digitsSum = 0
# if 3 digits then form 111
for i in range(len(digits)):
ones += "1"
for j in digits:
digitsSum = digitsSum + j
return factorial(len(digits)-1) * (digitsSum) * (int(ones))
However the results from this last function are wrong, and doesn't correspond with good results from passed tests. I'm not entirely sure if the number of permutations is different than itertools permutations method, but I'll confirm tomorrow for sure.

Best performance method for evaluating presence of substrings within strings in python?

I know I can do:
word = 'shrimp'
sentence = 'They eat shrimp in Mexico, and even more these days'
word in sentence
And so it'd would evaluate True.
But what if I have:
words = ['a','la','los','shrimp']
How can I evaluate whether sentence contains any words elements?
I care for performance as list might be large and I don't want to extend code lines using loops
Could be read as English. But might not be the most performant. First benchmark and then only if needed, use an optimized solution.
any(word in sentence for word in words)
any returns True if any element is True, otherwise False. (word in sentence for word in words) is an iterable, which produces whether each word is in the sentence or not.
>>> words = ['a','la','los','shrimp']
>>> sentence = 'They eat shrimp in Mexico, and more these days'
>>> [word for word in words if word in sentence]
['a', 'shrimp']
>>> any(word in sentence for word in words)
True
words = ['a','la','los','shrimp']
sentence = 'They eat shrimp in Mexico, and more these days'
a = map(lambda word: word in words if word in sentence else None, words)
print(a.__next__())
True
There are many ways to perform this.
Since you asked for Performance, here are the results.
In [1]: words = ['a','la','los','shrimp']
In [2]: sentence = 'They eat shrimp in Mexico
In [3]: a = map(lambda word: word in words if word in sentence else None, words)
In [4]: a.__next__()
Out[4]: True
In [5]: %timeit a
10000000 loops, best of 3: 65.3 ns per loop
In [6]: b = any(word in sentence for word in words)
In [7]: %timeit b
10000000 loops, best of 3: 50.1 ns per loop
words = ['a','la','fresh','shrimp','ban']
sentence = 'They fresh bananas in Mexico, and more these days'
a = [word for word in words if word in sentence.split(' ')]
print(a) #['fresh']
b = any(word in sentence.split(' ') for word in words)
print(b) #True
for word in words:
if word in sentence:
#do stuff
Here is what i Found.
Testing is done on 100k words:
NB: Generating sets is expensive inside the function if your list is long, BUT if you can pre-process into set before calling the function then, set will win.
Snippet:
#!/usr/bin/python
import cProfile
from timeit import Timer
from faker import Faker
def func1(sentence, words):
return any(word in sentence for word in words)
def func2(sentence, words):
for word in words:
if word in sentence:
return True
return False
def func3(sentence, words):
# using set
sets = set(sentence)
return bool(set(words).intersection(sets))
def func4(sentence, words):
# using set
sets = set(sentence)
for word in words:
if word in sets:
return True
return False
def func5(sentence, words):
# using set
sentence, words = set(sentence), set(words)
return not words.isdisjoint(sentence)
s = Faker()
sentence = s.sentence(nb_words=100000).split()
words = 'quidem necessitatibus minus id quos in neque omnis molestias'.split()
func = [ func1, func2, func3, func4, func5 ]
for fun in func:
t = Timer(lambda: fun(sentence, words))
print fun.__name__, cProfile.run('t.timeit(number=1000)')
Output:
Result: func2 wins
func1 5011 function calls in 0.032 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.032 0.032 <string>:1(<module>)
1000 0.000 0.000 0.032 0.000 exists.py:42(<lambda>)
1000 0.001 0.000 0.032 0.000 exists.py:8(func1)
2000 0.030 0.000 0.030 0.000 exists.py:9(<genexpr>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 0.032 0.032 timeit.py:178(timeit)
1 0.000 0.000 0.032 0.032 timeit.py:96(inner)
1000 0.000 0.000 0.030 0.000 {any}
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
func2 2011 function calls in 0.031 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.031 0.031 <string>:1(<module>)
1000 0.031 0.000 0.031 0.000 exists.py:11(func2)
1000 0.000 0.000 0.031 0.000 exists.py:42(<lambda>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 0.031 0.031 timeit.py:178(timeit)
1 0.000 0.000 0.031 0.031 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
func3 3011 function calls in 7.079 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 7.079 7.079 <string>:1(<module>)
1000 7.069 0.007 7.073 0.007 exists.py:17(func3)
1000 0.004 0.000 7.077 0.007 exists.py:42(<lambda>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 7.079 7.079 timeit.py:178(timeit)
1 0.002 0.002 7.079 7.079 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1000 0.004 0.000 0.004 0.000 {method 'intersection' of 'set' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
func4 2011 function calls in 7.022 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 7.022 7.022 <string>:1(<module>)
1000 7.014 0.007 7.014 0.007 exists.py:22(func4)
1000 0.006 0.000 7.020 0.007 exists.py:42(<lambda>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 7.022 7.022 timeit.py:178(timeit)
1 0.002 0.002 7.022 7.022 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
func5 3011 function calls in 7.142 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 7.142 7.142 <string>:1(<module>)
1000 7.133 0.007 7.134 0.007 exists.py:30(func5)
1000 0.006 0.000 7.140 0.007 exists.py:42(<lambda>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 7.142 7.142 timeit.py:178(timeit)
1 0.002 0.002 7.142 7.142 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1000 0.002 0.000 0.002 0.000 {method 'isdisjoint' of 'set' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None

Characters in Strings in Python

Implement a function with signature find_chars(string1, string2) that
takes two strings and returns a string that contains only the
characters found in string1 and string two in the order that they are
found in string1. Implement a version of order N*N and one of order N.
(Source: http://thereq.com/q/138/python-software-interview-question/characters-in-strings)
Here are my solutions:
Order N*N:
def find_chars_slow(string1, string2):
res = []
for char in string1:
if char in string2:
res.append(char)
return ''.join(res)
So the for loop goes through N elements, and each char in string2 check is another N operations so this gives N*N.
Order N:
from collections import defaultdict
def find_char_fast(string1, string2):
d = defaultdict(int)
for char in string2:
d[char] += 1
res = []
for char in string1:
if char in d:
res.append(char)
return ''.join(res)
First store the characters of string2 as a dictionary (O(N)). Then scan string1 (O(N)) and check if it is in the dict (O(1)). This gives a total runtime of O(2N) = O(N).
Is the above correct? Is there a faster method?
Your solution is algorithmically correct (the first is O(n**2), and the second is O(n)) but you're doing some things that are going to be possible red flags to an interviewer.
The first function is basically okay. You might get bonus points for writing it like this:
''.join([c for c in string1 if c in string2])
..which does essentially the same thing.
My problem (if I'm wearing my interviewer pants) with how you've written the second function is that you use a defaultdict where you don't care at all about the count - you only care about membership. This is the classic case for when to use a set.
seen = set(string2)
''.join([c for c in string1 if c in seen])
The way I've written these functions are going to be slightly faster than what you wrote, since list comprehensions loop in native code rather than in python bytecode. They are algorithmically the same complexity.
Your method is sound, and there is no method with time complexity less than O(N), since you obviously need to go through each character at least once.
That is not to say that there's no method that runs faster. There's no need to actually increment the numbers in the dictionary. You could, for example, use a set. You could also make further use of python's features, such as list comprehensions/generators:
def find_char_fast2(string1, string2):
s= set(string2)
return "".join( (x for x in string1 if x in s) )
The algorithms you have used are perfectly fine. There are few improvements you can do
Since you are converting the second string to a dictionary, I would recommend using set, like this
d = set(string2)
Apart from that you can use list comprehension, as a filter, like this
return "".join([char for char in string1 if char in d])
If the order of the characters in the output doesn't matter, you can simply convert both the strings to sets and simply find set difference, like this
return "".join(set(string1) - set(string2))
Hi, I am trying to profiling the various solution given here:
In my snippet, I am using a module called faker to generate fake words so i can test on very long string more than 20k characters:
Snippet:
from faker import Faker
from timeit import Timer
from collections import defaultdict
def first(string1, string2):
sets = set(string2)
return ''.join((c for c in string1 if c in sets))
def second(s1, s2):
res = []
for char in string1:
if char in string2:
res.append(char)
return ''.join(res)
def third(s1, s2):
d = defaultdict(int)
for char in string2:
d[char] += 1
res = []
for char in string1:
if char in d:
res.append(char)
return ''.join(res)
f = Faker()
string1 = ''.join(f.paragraph(nb_sentences=10000).split())
string2 = ''.join(f.paragraph(nb_sentences=10000).split())
funcs = [first, second, third]
import cProfile
print 'Length of String1: ', len(string1)
print 'Length of String2: ', len(string2)
print 'Time taken to execute:'
for f in funcs:
t = Timer(lambda: f(string1, string2))
print f.__name__, cProfile.run('t.timeit(number=100)')
Output:
Length of String1: 525133
Length of String2: 501050
Time taken to execute:
first 52513711 function calls in 18.169 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
100 0.001 0.000 18.164 0.182 s.py:39(<lambda>)
100 1.723 0.017 18.163 0.182 s.py:5(first)
52513400 9.442 0.000 9.442 0.000 s.py:7(<genexpr>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 18.169 18.169 timeit.py:178(timeit)
1 0.005 0.005 18.169 18.169 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100 6.998 0.070 16.440 0.164 {method 'join' of 'str' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
second 52513611 function calls in 22.280 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 22.280 22.280 <string>:1(<module>)
100 0.121 0.001 22.275 0.223 s.py:39(<lambda>)
100 16.957 0.170 22.153 0.222 s.py:9(second)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 22.280 22.280 timeit.py:178(timeit)
1 0.005 0.005 22.280 22.280 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
52513300 4.018 0.000 4.018 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100 1.179 0.012 1.179 0.012 {method 'join' of 'str' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
third 52513611 function calls in 28.184 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 28.184 28.184 <string>:1(<module>)
100 22.847 0.228 28.059 0.281 s.py:16(third)
100 0.120 0.001 28.179 0.282 s.py:39(<lambda>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 28.184 28.184 timeit.py:178(timeit)
1 0.005 0.005 28.184 28.184 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
52513300 4.032 0.000 4.032 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100 1.180 0.012 1.180 0.012 {method 'join' of 'str' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
Conclusion:
So, the first function with comprehension is the fastest.
But when you run on strings size around 25K characters second functions wins.
Length of String1: 22959
Length of String2: 452919
Time taken to execute:
first 2296311 function calls in 2.216 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
100 0.000 0.000 2.216 0.022 s.py:39(<lambda>)
100 1.530 0.015 2.216 0.022 s.py:5(first)
2296000 0.402 0.000 0.402 0.000 s.py:7(<genexpr>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 2.216 2.216 timeit.py:178(timeit)
1 0.000 0.000 2.216 2.216 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100 0.284 0.003 0.686 0.007 {method 'join' of 'str' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
second 2296211 function calls in 0.939 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.939 0.939 <string>:1(<module>)
100 0.003 0.000 0.939 0.009 s.py:39(<lambda>)
100 0.729 0.007 0.936 0.009 s.py:9(second)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 0.939 0.939 timeit.py:178(timeit)
1 0.000 0.000 0.939 0.939 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
2295900 0.165 0.000 0.165 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100 0.042 0.000 0.042 0.000 {method 'join' of 'str' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
third 2296211 function calls in 8.361 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 8.361 8.361 <string>:1(<module>)
100 8.145 0.081 8.357 0.084 s.py:16(third)
100 0.004 0.000 8.361 0.084 s.py:39(<lambda>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 8.361 8.361 timeit.py:178(timeit)
1 0.000 0.000 8.361 8.361 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
2295900 0.169 0.000 0.169 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100 0.043 0.000 0.043 0.000 {method 'join' of 'str' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None

how to avoid substrings

I currently process sections of a string like this:
for (i, j) in huge_list_of_indices:
process(huge_text_block[i:j])
I want to avoid the overhead of generating these temporary substrings. Any ideas? Perhaps a wrapper that somehow uses index offsets? This is currently my bottleneck.
Note that process() is another python module that expects a string as input.
Edit:
A few people doubt there is a problem. Here are some sample results:
import time
import string
text = string.letters * 1000
def timeit(fn):
t1 = time.time()
for i in range(len(text)):
fn(i)
t2 = time.time()
print '%s took %0.3f ms' % (fn.func_name, (t2-t1) * 1000)
def test_1(i):
return text[i:]
def test_2(i):
return text[:]
def test_3(i):
return text
timeit(test_1)
timeit(test_2)
timeit(test_3)
Output:
test_1 took 972.046 ms
test_2 took 47.620 ms
test_3 took 43.457 ms
I think what you are looking for are buffers.
The characteristic of buffers is that they "slice" an object supporting the buffer interface without copying its content, but essentially opening a "window" on the sliced object content. Some more technical explanation is available here. An excerpt:
Python objects implemented in C can export a group of functions called the “buffer interface.” These functions can be used by an object to expose its data in a raw, byte-oriented format. Clients of the object can use the buffer interface to access the object data directly, without needing to copy it first.
In your case the code should look more or less like this:
>>> s = 'Hugely_long_string_not_to_be_copied'
>>> ij = [(0, 3), (6, 9), (12, 18)]
>>> for i, j in ij:
... print buffer(s, i, j-i) # Should become process(...)
Hug
_lo
string
HTH!
A wrapper that uses index offsets to a mmap object could work, yes.
But before you do that, are you sure that generating these substrings are a problem? Don't optimize before you have found out where the time and memory actually goes. I wouldn't expect this to be a significant problem.
If you are using Python3 you can use protocol buffer and memory views. Assuming that the text is stored somewhere in the filesystem:
f = open(FILENAME, 'rb')
data = bytearray(os.path.getsize(FILENAME))
f.readinto(data)
mv = memoryview(data)
for (i, j) in huge_list_of_indices:
process(mv[i:j])
Check also this article. It might be useful.
Maybe a wrapper that uses index offsets is indeed what you are looking for. Here is an example that does the job. You may have to add more checks on slices (for overflow and negative indexes) depending on your needs.
#!/usr/bin/env python
from collections import Sequence
from timeit import Timer
def process(s):
return s[0], len(s)
class FakeString(Sequence):
def __init__(self, string):
self._string = string
self.fake_start = 0
self.fake_stop = len(string)
def setFakeIndices(self, i, j):
self.fake_start = i
self.fake_stop = j
def __len__(self):
return self.fake_stop - self.fake_start
def __getitem__(self, ii):
if isinstance(ii, slice):
if ii.start is None:
start = self.fake_start
else:
start = ii.start + self.fake_start
if ii.stop is None:
stop = self.fake_stop
else:
stop = ii.stop + self.fake_start
ii = slice(start,
stop,
ii.step)
else:
ii = ii + self.fake_start
return self._string[ii]
def initial_method():
r = []
for n in xrange(1000):
r.append(process(huge_string[1:9999999]))
return r
def alternative_method():
r = []
for n in xrange(1000):
fake_string.setFakeIndices(1, 9999999)
r.append(process(fake_string))
return r
if __name__ == '__main__':
huge_string = 'ABCDEFGHIJ' * 100000
fake_string = FakeString(huge_string)
fake_string.setFakeIndices(5,15)
assert fake_string[:] == huge_string[5:15]
t = Timer(initial_method)
print "initial_method(): %fs" % t.timeit(number=1)
which gives:
initial_method(): 1.248001s
alternative_method(): 0.003416s
The example the OP gives, will give nearly biggest performance difference between slicing and not slicing possible.
If processing actually does something that takes significant time, the problem may hardly exist.
Fact is OP needs to let us know what process does. The most likely scenario is it does something significant, and therefore he should profile his code.
Adapted from op's example:
#slice_time.py
import time
import string
text = string.letters * 1000
import random
indices = range(len(text))
random.shuffle(indices)
import re
def greater_processing(a_string):
results = re.findall('m', a_string)
def medium_processing(a_string):
return re.search('m.*?m', a_string)
def lesser_processing(a_string):
return re.match('m', a_string)
def least_processing(a_string):
return a_string
def timeit(fn, processor):
t1 = time.time()
for i in indices:
fn(i, i + 1000, processor)
t2 = time.time()
print '%s took %0.3f ms %s' % (fn.func_name, (t2-t1) * 1000, processor.__name__)
def test_part_slice(i, j, processor):
return processor(text[i:j])
def test_copy(i, j, processor):
return processor(text[:])
def test_text(i, j, processor):
return processor(text)
def test_buffer(i, j, processor):
return processor(buffer(text, i, j - i))
if __name__ == '__main__':
processors = [least_processing, lesser_processing, medium_processing, greater_processing]
tests = [test_part_slice, test_copy, test_text, test_buffer]
for processor in processors:
for test in tests:
timeit(test, processor)
And then the run...
In [494]: run slice_time.py
test_part_slice took 68.264 ms least_processing
test_copy took 42.988 ms least_processing
test_text took 33.075 ms least_processing
test_buffer took 76.770 ms least_processing
test_part_slice took 270.038 ms lesser_processing
test_copy took 197.681 ms lesser_processing
test_text took 196.716 ms lesser_processing
test_buffer took 262.288 ms lesser_processing
test_part_slice took 416.072 ms medium_processing
test_copy took 352.254 ms medium_processing
test_text took 337.971 ms medium_processing
test_buffer took 438.683 ms medium_processing
test_part_slice took 502.069 ms greater_processing
test_copy took 8149.231 ms greater_processing
test_text took 8292.333 ms greater_processing
test_buffer took 563.009 ms greater_processing
Notes:
Yes I tried OP's original test_1 with [i:] slice and it's much slower, making his test even more bunk.
Interesting that buffer almost always performs slightly slower then slicing. This time there is one where it does better though! The real test though is below and buffer seems to do better for larger substrings while slicing does better for smaller substrings.
And, yes, I do have some randomness in this test so test away and see the different results :). It also may be interesting to changes the size of the 1000's.
So, maybe some others believe you, but I don't, so I'd like to know something about what processing does and how you came to the conclusion: "slicing is the problem."
I profiled medium processing in my example and upped the string.letters multiplier to 100000 and raised the length of the slices to 10000. Also below is one with slices of length 100. I used cProfile for these (much less overhead then profile!).
test_part_slice took 77338.285 ms medium_processing
31200019 function calls in 77.338 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 77.338 77.338 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 iostream.py:63(write)
5200000 8.208 0.000 43.823 0.000 re.py:139(search)
5200000 9.205 0.000 12.897 0.000 re.py:228(_compile)
5200000 5.651 0.000 49.475 0.000 slice_time.py:15(medium_processing)
1 7.901 7.901 77.338 77.338 slice_time.py:24(timeit)
5200000 19.963 0.000 69.438 0.000 slice_time.py:31(test_part_slice)
2 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
2 0.000 0.000 0.000 0.000 {_codecs.utf_8_decode}
2 0.000 0.000 0.000 0.000 {isinstance}
2 0.000 0.000 0.000 0.000 {method 'decode' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
5200000 3.692 0.000 3.692 0.000 {method 'get' of 'dict' objects}
5200000 22.718 0.000 22.718 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
2 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
4 0.000 0.000 0.000 0.000 {time.time}
test_buffer took 58067.440 ms medium_processing
31200103 function calls in 58.068 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 58.068 58.068 <string>:1(<module>)
3 0.000 0.000 0.000 0.000 __init__.py:185(dumps)
3 0.000 0.000 0.000 0.000 encoder.py:102(__init__)
3 0.000 0.000 0.000 0.000 encoder.py:180(encode)
3 0.000 0.000 0.000 0.000 encoder.py:206(iterencode)
1 0.000 0.000 0.001 0.001 iostream.py:37(flush)
2 0.000 0.000 0.001 0.000 iostream.py:63(write)
1 0.000 0.000 0.000 0.000 iostream.py:86(_new_buffer)
3 0.000 0.000 0.000 0.000 jsonapi.py:57(_squash_unicode)
3 0.000 0.000 0.000 0.000 jsonapi.py:69(dumps)
2 0.000 0.000 0.000 0.000 jsonutil.py:78(date_default)
1 0.000 0.000 0.000 0.000 os.py:743(urandom)
5200000 6.814 0.000 39.110 0.000 re.py:139(search)
5200000 7.853 0.000 10.878 0.000 re.py:228(_compile)
1 0.000 0.000 0.000 0.000 session.py:149(msg_header)
1 0.000 0.000 0.000 0.000 session.py:153(extract_header)
1 0.000 0.000 0.000 0.000 session.py:315(msg_id)
1 0.000 0.000 0.000 0.000 session.py:350(msg_header)
1 0.000 0.000 0.000 0.000 session.py:353(msg)
1 0.000 0.000 0.000 0.000 session.py:370(sign)
1 0.000 0.000 0.000 0.000 session.py:385(serialize)
1 0.000 0.000 0.001 0.001 session.py:437(send)
3 0.000 0.000 0.000 0.000 session.py:75(<lambda>)
5200000 4.732 0.000 43.842 0.000 slice_time.py:15(medium_processing)
1 5.423 5.423 58.068 58.068 slice_time.py:24(timeit)
5200000 8.802 0.000 52.645 0.000 slice_time.py:40(test_buffer)
7 0.000 0.000 0.000 0.000 traitlets.py:268(__get__)
2 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
1 0.000 0.000 0.000 0.000 uuid.py:101(__init__)
1 0.000 0.000 0.000 0.000 uuid.py:197(__str__)
1 0.000 0.000 0.000 0.000 uuid.py:531(uuid4)
2 0.000 0.000 0.000 0.000 {_codecs.utf_8_decode}
1 0.000 0.000 0.000 0.000 {built-in method now}
18 0.000 0.000 0.000 0.000 {isinstance}
4 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {locals}
1 0.000 0.000 0.000 0.000 {map}
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'close' of '_io.StringIO' objects}
1 0.000 0.000 0.000 0.000 {method 'count' of 'list' objects}
2 0.000 0.000 0.000 0.000 {method 'decode' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'extend' of 'list' objects}
5200001 3.025 0.000 3.025 0.000 {method 'get' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'getvalue' of '_io.StringIO' objects}
3 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
5200000 21.418 0.000 21.418 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
1 0.000 0.000 0.000 0.000 {method 'send_multipart' of 'zmq.core.socket.Socket' objects}
2 0.000 0.000 0.000 0.000 {method 'strftime' of 'datetime.date' objects}
1 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}
2 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
1 0.000 0.000 0.000 0.000 {posix.close}
1 0.000 0.000 0.000 0.000 {posix.open}
1 0.000 0.000 0.000 0.000 {posix.read}
4 0.000 0.000 0.000 0.000 {time.time}
Smaller slices (100 length).
test_part_slice took 54916.153 ms medium_processing
31200019 function calls in 54.916 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 54.916 54.916 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 iostream.py:63(write)
5200000 6.788 0.000 38.312 0.000 re.py:139(search)
5200000 8.014 0.000 11.257 0.000 re.py:228(_compile)
5200000 4.722 0.000 43.034 0.000 slice_time.py:15(medium_processing)
1 5.594 5.594 54.916 54.916 slice_time.py:24(timeit)
5200000 6.288 0.000 49.322 0.000 slice_time.py:31(test_part_slice)
2 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
2 0.000 0.000 0.000 0.000 {_codecs.utf_8_decode}
2 0.000 0.000 0.000 0.000 {isinstance}
2 0.000 0.000 0.000 0.000 {method 'decode' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
5200000 3.242 0.000 3.242 0.000 {method 'get' of 'dict' objects}
5200000 20.268 0.000 20.268 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
2 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
4 0.000 0.000 0.000 0.000 {time.time}
test_buffer took 62019.684 ms medium_processing
31200103 function calls in 62.020 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 62.020 62.020 <string>:1(<module>)
3 0.000 0.000 0.000 0.000 __init__.py:185(dumps)
3 0.000 0.000 0.000 0.000 encoder.py:102(__init__)
3 0.000 0.000 0.000 0.000 encoder.py:180(encode)
3 0.000 0.000 0.000 0.000 encoder.py:206(iterencode)
1 0.000 0.000 0.001 0.001 iostream.py:37(flush)
2 0.000 0.000 0.001 0.000 iostream.py:63(write)
1 0.000 0.000 0.000 0.000 iostream.py:86(_new_buffer)
3 0.000 0.000 0.000 0.000 jsonapi.py:57(_squash_unicode)
3 0.000 0.000 0.000 0.000 jsonapi.py:69(dumps)
2 0.000 0.000 0.000 0.000 jsonutil.py:78(date_default)
1 0.000 0.000 0.000 0.000 os.py:743(urandom)
5200000 7.426 0.000 41.152 0.000 re.py:139(search)
5200000 8.470 0.000 11.628 0.000 re.py:228(_compile)
1 0.000 0.000 0.000 0.000 session.py:149(msg_header)
1 0.000 0.000 0.000 0.000 session.py:153(extract_header)
1 0.000 0.000 0.000 0.000 session.py:315(msg_id)
1 0.000 0.000 0.000 0.000 session.py:350(msg_header)
1 0.000 0.000 0.000 0.000 session.py:353(msg)
1 0.000 0.000 0.000 0.000 session.py:370(sign)
1 0.000 0.000 0.000 0.000 session.py:385(serialize)
1 0.000 0.000 0.001 0.001 session.py:437(send)
3 0.000 0.000 0.000 0.000 session.py:75(<lambda>)
5200000 5.399 0.000 46.551 0.000 slice_time.py:15(medium_processing)
1 5.958 5.958 62.020 62.020 slice_time.py:24(timeit)
5200000 9.510 0.000 56.061 0.000 slice_time.py:40(test_buffer)
7 0.000 0.000 0.000 0.000 traitlets.py:268(__get__)
2 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
1 0.000 0.000 0.000 0.000 uuid.py:101(__init__)
1 0.000 0.000 0.000 0.000 uuid.py:197(__str__)
1 0.000 0.000 0.000 0.000 uuid.py:531(uuid4)
2 0.000 0.000 0.000 0.000 {_codecs.utf_8_decode}
1 0.000 0.000 0.000 0.000 {built-in method now}
18 0.000 0.000 0.000 0.000 {isinstance}
4 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {locals}
1 0.000 0.000 0.000 0.000 {map}
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'close' of '_io.StringIO' objects}
1 0.000 0.000 0.000 0.000 {method 'count' of 'list' objects}
2 0.000 0.000 0.000 0.000 {method 'decode' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'extend' of 'list' objects}
5200001 3.158 0.000 3.158 0.000 {method 'get' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'getvalue' of '_io.StringIO' objects}
3 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
5200000 22.097 0.000 22.097 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
1 0.000 0.000 0.000 0.000 {method 'send_multipart' of 'zmq.core.socket.Socket' objects}
2 0.000 0.000 0.000 0.000 {method 'strftime' of 'datetime.date' objects}
1 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}
2 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
1 0.000 0.000 0.000 0.000 {posix.close}
1 0.000 0.000 0.000 0.000 {posix.open}
1 0.000 0.000 0.000 0.000 {posix.read}
4 0.000 0.000 0.000 0.000 {time.time}
process(huge_text_block[i:j])
I want to avoid the overhead of generating these temporary substrings.
(...)
Note that process() is another python module
that expects a string as input.
Completely contradictory.
How can you imagine to find a way for not creating what the function requires ?!

Categories

Resources