I'm looking for speedy alternatives to my function. the goal is to make a list of 32 bit integers based of any length integers. The length is explicitly given in a tuple of (value, bitlength). This is part of a bit-banging procedure for a asynchronous interface which takes 4 32 bit integers per bus transaction.
All ints are unsigned, positive or zero, the length can vary between 0 and 2000
My inputs is a list of these tuples,
the output should be integers with implicit 32 bit length, with the bits in sequential order. The remaining bits not fitting into 32 should also be returned.
input: [(0,128),(1,12),(0,32)]
output:[0, 0, 0, 0, 0x100000], 0, 12
I've spent a day or two on profiling with cProfile, and trying different methods, but I seem to be kind of stuck with functions that takes ~100k tuples in one second, which is kinda slow. Ideally i would like a 10x speedup, but I haven't got enough experience to know where to start. The ultimate goal for the speed of this is to be faster than 4M tuples per second.
Thanks for any help or suggestions.
the fastest i can do is:
def foo(tuples):
'''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
length = 0
remlen = 0
remint = 0
i32list = []
for a, b in tuples:
n = (remint << (32-remlen)) | a #n = (a << (remlen)) | remint
length += b
if length > 32:
len32 = int(length/32)
for i in range(len32):
i32list.append((n >> i*32) & 0xFFFFFFFF)
remint = n >> (len32*32)
remlen = length - len32*32
length = remlen
elif length == 32:
appint = n & 0xFFFFFFFF
remint = 0
remlen = 0
length -= 32
i32list.append(appint)
else:
remint = n
remlen = length
return i32list, remint, remlen
this has very similar performance:
def tpli_2_32ili(tuples):
'''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
# binarylist = "".join([np.binary_repr(a, b) for a, b in inp]) # bin(a)[2:].rjust(b, '0')
binarylist = "".join([bin(a)[2:].rjust(b, '0') for a, b in tuples])
totallength = len(binarylist)
tot32 = int(totallength/32)
i32list = [int(binarylist[i:i+32],2) for i in range(0, tot32*32, 32) ]
remlen = totallength - tot32*32
remint = int(binarylist[-remlen:],2)
return i32list, remint, remlen
The best I could come up with so far is a 25% speed-up
from functools import reduce
intMask = 0xffffffff
def f(x,y):
return (x[0] << y[1]) + y[0], x[1] + y[1]
def jens(input):
n, length = reduce( f , input, (0,0) )
remainderBits = length % 32
intBits = length - remainderBits
remainder = ((n & intMask) << (32 - remainderBits)) >> (32 - remainderBits)
n >>= remainderBits
ints = [n & (intMask << i) for i in range(intBits-32, -32, -32)]
return ints, remainderBits, remainder
print([hex(x) for x in jens([(0,128),(1,12),(0,32)])[0]])
It uses a long to sum up the tuple values according to their bit length, and then extract the 32-bit values and the remaining bits from this number. The computastion of the overall length (summing up the length values of the input tuple) and the computation of the large value are done in a single loop with reduce to use an intrinsic loop.
Running matineau's benchmark harness prints, the best numbers I have seen are:
Fastest to slowest execution speeds using Python 3.6.5
(1,000 executions, best of 3 repetitions)
jens : 0.004151 secs, rel speed 1.00x, 0.00% slower
First snippet : 0.005259 secs, rel speed 1.27x, 26.70% slower
Second snippet : 0.008328 secs, rel speed 2.01x, 100.64% slower
You could probably gain a better speed-up if you use some C extension implementing a bit array.
This isn't an answer with a faster implementation. Instead it's the code in the two snippets you have in your question placed within an extensible benchmarking framework that makes comparing different approaches very easy.
Comparing just those two testcases, it indicates that your second approach does not have very similar performance to the first, based on the output shown. In fact, it's almost twice as slow.
from collections import namedtuple
import sys
from textwrap import dedent
import timeit
import traceback
N = 1000 # Number of executions of each "algorithm".
R = 3 # Number of repetitions of those N executions.
# Common setup for all testcases (executed before any algorithm specific setup).
COMMON_SETUP = dedent("""
# Import any resources needed defined in outer benchmarking script.
#from __main__ import ??? # Not needed at this time
""")
class TestCase(namedtuple('CodeFragments', ['setup', 'test'])):
""" A test case is composed of separate setup and test code fragments. """
def __new__(cls, setup, test):
""" Dedent code fragment in each string argument. """
return tuple.__new__(cls, (dedent(setup), dedent(test)))
testcases = {
"First snippet": TestCase("""
def foo(tuples):
'''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
length = 0
remlen = 0
remint = 0
i32list = []
for a, b in tuples:
n = (remint << (32-remlen)) | a #n = (a << (remlen)) | remint
length += b
if length > 32:
len32 = int(length/32)
for i in range(len32):
i32list.append((n >> i*32) & 0xFFFFFFFF)
remint = n >> (len32*32)
remlen = length - len32*32
length = remlen
elif length == 32:
appint = n & 0xFFFFFFFF
remint = 0
remlen = 0
length -= 32
i32list.append(appint)
else:
remint = n
remlen = length
return i32list, remint, remlen
""", """
foo([(0,128),(1,12),(0,32)])
"""
),
"Second snippet": TestCase("""
def tpli_2_32ili(tuples):
'''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
binarylist = "".join([bin(a)[2:].rjust(b, '0') for a, b in tuples])
totallength = len(binarylist)
tot32 = int(totallength/32)
i32list = [int(binarylist[i:i+32],2) for i in range(0, tot32*32, 32) ]
remlen = totallength - tot32*32
remint = int(binarylist[-remlen:],2)
return i32list, remint, remlen
""", """
tpli_2_32ili([(0,128),(1,12),(0,32)])
"""
),
}
# Collect timing results of executing each testcase multiple times.
try:
results = [
(label,
min(timeit.repeat(testcases[label].test,
setup=COMMON_SETUP + testcases[label].setup,
repeat=R, number=N)),
) for label in testcases
]
except Exception:
traceback.print_exc(file=sys.stdout) # direct output to stdout
sys.exit(1)
# Display results.
major, minor, micro = sys.version_info[:3]
print('Fastest to slowest execution speeds using Python {}.{}.{}\n'
'({:,d} executions, best of {:d} repetitions)'.format(major, minor, micro, N, R))
print()
longest = max(len(result[0]) for result in results) # length of longest label
ranked = sorted(results, key=lambda t: t[1]) # ascending sort by execution time
fastest = ranked[0][1]
for result in ranked:
print('{:>{width}} : {:9.6f} secs, rel speed {:5,.2f}x, {:8,.2f}% slower '
''.format(
result[0], result[1], round(result[1]/fastest, 2),
round((result[1]/fastest - 1) * 100, 2),
width=longest))
Output:
Fastest to slowest execution speeds using Python 3.6.5
(1,000 executions, best of 3 repetitions)
First snippet : 0.003024 secs, rel speed 1.00x, 0.00% slower
Second snippet : 0.005085 secs, rel speed 1.68x, 68.13% slower
Related
Consider we have 2 arrays of size N, with their values in the range [0, N-1]. For example:
a = np.array([0, 1, 2, 0])
b = np.array([2, 0, 3, 3])
I need to produce a new array c which contains exactly N/2 elements from a and b respectively, i.e. the values must be taken evenly/equally from both parent arrays.
(For odd length, this would be (N-1)/2 and (N+1)/2. Can also ignore odd length case, not important).
Taking equal number of elements from two arrays is pretty trivial, but there is an additional constraint: c should have as many unique numbers as possible / as few duplicates as possible.
For example, a solution to a and b above is:
c = np.array([b[0], a[1], b[2], a[3]])
>>> c
array([2, 1, 3, 0])
Note that the position/order is preserved. Each element of a and b that we took to form c is in same position. If element i in c is from a, c[i] == a[i], same for b.
A straightforward solution for this is simply a sort of path traversal, easy enough to implement recursively:
def traverse(i, a, b, path, n_a, n_b, best, best_path):
if n_a == 0 and n_b == 0:
score = len(set(path))
return (score, path.copy()) if score > best else (best, best_path)
if n_a > 0:
path.append(a[i])
best, best_path = traverse(i + 1, a, b, path, n_a - 1, n_b, best, best_path)
path.pop()
if n_b > 0:
path.append(b[i])
best, best_path = traverse(i + 1, a, b, path, n_a, n_b - 1, best, best_path)
path.pop()
return best, best_path
Here n_a and n_b are how many values we will take from a and b respectively, it's 2 and 2 as we want to evenly take 4 items.
>>> score, best_path = traverse(0, a, b, [], 2, 2, 0, None)
>>> score, best_path
(4, [2, 1, 3, 0])
Is there a way to implement the above in a more vectorized/efficient manner, possibly through numpy?
The algorithm is slow mainly because it runs in an exponential time. There is no straightforward way to vectorize this algorithm using only Numpy because of the recursion. Even if it would be possible, the huge number of combinations would cause most Numpy implementations to be inefficient (due to large Numpy arrays to compute). Additionally, there is AFAIK no vectorized operation to count the number of unique values of many rows efficiently (the usual way is to use np.unique which is not efficient in this case and cannot be use without a loop). As a result, there is two possible strategy to speed this up:
trying to find an algorithm with a reasonable complexity (eg. <= O(n^4));
using compilation methods, micro-optimizations and tricks to write a faster brute-force implementation.
Since finding a correct sub-exponential algorithm turns out not to be easy, I choose the other approach (though the first approach is the best).
The idea is to:
remove the recursion by generating all possible solutions using a loop iterating on integer;
write a fast way to count unique items of an array;
use the Numba JIT compiler so to optimize the code that is only efficient once compiled.
Here is the final code:
import numpy as np
import numba as nb
# Naive way to count unique items.
# This is a slow fallback implementation.
#nb.njit
def naive_count_unique(arr):
count = 0
for i in range(len(arr)):
val = arr[i]
found = False
for j in range(i):
if arr[j] == val:
found = True
break
if not found:
count += 1
return count
# Optimized way to count unique items on small arrays.
# Count items 2 by 2.
# Fast on small arrays.
#nb.njit
def optim_count_unique(arr):
count = 0
for i in range(0, len(arr), 2):
if arr[i] == arr[i+1]:
tmp = 1
for j in range(i):
if arr[j] == arr[i]: tmp = 0
count += tmp
else:
val1, val2 = arr[i], arr[i+1]
tmp1, tmp2 = 1, 1
for j in range(i):
val = arr[j]
if val == val1: tmp1 = 0
if val == val2: tmp2 = 0
count += tmp1 + tmp2
return count
#nb.njit
def count_unique(arr):
if len(arr) % 2 == 0:
return optim_count_unique(arr)
else:
# Odd case: not optimized yet
return naive_count_unique(arr)
# Count the number of bits in a 32-bit integer
# See https://stackoverflow.com/questions/71097470/msb-lsb-popcount-in-numba
#nb.njit('int_(uint32)', inline='always')
def popcount(v):
v = v - ((v >> 1) & 0x55555555)
v = (v & 0x33333333) + ((v >> 2) & 0x33333333)
c = np.uint32((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24
return c
# Count the number of bits in a 64-bit integer
#nb.njit(inline='always')
def bit_count(n):
if n < (1 << 30):
return popcount(np.uint32(n))
else:
return popcount(np.uint32(n)) + popcount(np.uint32(n >> 32))
# Mutate `out` so not to create an expensive new temporary array
#nb.njit
def int_to_path(n, out, a, b):
for i in range(len(out)):
out[i] = a[i] if ((n >> i) & 1) else b[i]
#nb.njit(['(int32[:], int32[:], int64, int64)', '(int64[:], int64[:], int64, int64)'])
def traverse_fast(a, b, n_a, n_b):
# This assertion is needed because the paths are encoded using 64-bit.
# This should not be a problem in practice since the number of solutions to
# test would be impracticably huge to test using this algorithm anyway.
assert n_a + n_b < 62
max_iter = 1 << (n_a + n_b)
path = np.empty(n_a + n_b, dtype=a.dtype)
score, best_score, best_i = 0, 0, 0
# Iterate over all cases (more than the set of possible solution)
for i in range(max_iter):
# Filter the possible solutions
if bit_count(i) != n_b:
continue
# Analyse the score of the solution
int_to_path(i, path, a, b)
score = count_unique(path)
# Store it if it better than the previous one
if score > best_score:
best_score = score
best_i = i
int_to_path(best_i, path, a, b)
return best_score, path
This implementation is about 30 times faster on arrays of size 8 on my machine. On could use several cores to speed this up even further. However, I think it is better to focus on finding a sub-exponential implementation so to avoid wasting more computing resources. Note that the path is different from the initial function but the score is the same on random arrays. It can help others to test their implementation on larger arrays without waiting for a long time.
Test this heavily.
import numpy as np
from numpy.random._generator import default_rng
rand = default_rng(seed=1)
n = 16
a = rand.integers(low=0, high=n, size=n)
b = rand.integers(low=0, high=n, size=n)
uniques = np.setxor1d(a, b)
print(a)
print(b)
print(uniques)
def limited_uniques(arr: np.ndarray) -> np.ndarray:
choose = np.zeros(shape=n, dtype=bool)
_, idx, _ = np.intersect1d(arr, uniques, return_indices=True)
idx = idx[:n//2]
choose[idx] = True
n_missing = n//2 - len(idx)
counts = choose.cumsum()
diffs = np.arange(n) - counts
at = np.searchsorted(diffs, n_missing)
choose[:at] = True
return arr[choose]
a_half = limited_uniques(a)
uniques = np.union1d(uniques, np.setdiff1d(a, a_half))
interleaved = np.empty_like(a)
interleaved[0::2] = a_half
interleaved[1::2] = limited_uniques(b)
print(interleaved)
[ 7 8 12 15 0 2 13 15 3 4 13 6 4 13 4 6]
[10 8 1 0 13 12 13 8 13 5 7 12 1 4 1 7]
[ 1 2 3 5 6 10 15]
[ 7 10 8 8 12 1 15 0 0 13 2 12 3 5 6 4]
I have implemented a Mersenne Twister in Python using the following example code and while it does work as intended, I am not clear on how to limit results returned to a range of integers. For example, if I wanted to use this mechanism to determine the value of a dice roll, I would (right now, in an incredibly inefficient manner) iterate through potential results from the MT until something falls within the set. There has to be a much more memory-efficient manner to get a value to fall within a set range, like 1-20.
Currently, the algorithm returns a random set of numbers that don't seem to peak anywhere close to the set range:
2840889030
2341262508
2626522481
893458501
1134227444
3424236607
4171927007
1414775506
318984778
811882651
1509520423
1796453323
571461449
2606098999
2100002233
202969379
2318195635
1583585513
863717092
1218132929
1044954980
2997947229
867650808
177016714
2532350044
2917724494
2789913671
2793703767
1477382755
2552234519
2230774266
956596469
1165204853
1261233074
1856099289
21274564
1867584221
200970721
2112891842
139474834
93227265
1919721548
1026587194
30693196
3114464709
2194502660
2235520335
1877205724
1093736467
3136329929
1838505684
1358237877
2394536120
1268347552
1222927042
2982839076
1155599683
1943346953
3778719619
1483759762
3227630028
2775862513
2991889829
4252811853
995611629
626323532
3895812866
4027023347
3778533921
3840271846
4289281429
2263887842
402963991
2957069652
238880521
3643974307
472466724
3309455978
3588191581
1390613042
290666747
1375502175
1172854301
2159248842
3279978887
2206149102
804187781
3811948116
4134597627
1556281173
2590972812
3291094915
1836658937
3721612785
365099684
3884686172
2966532828
3609464378
1672431128
3959413372
For testing purposes I implemented the current test logic (part of __main__) and so far it just runs infinitely.
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
Based on the pseudocode in https://en.wikipedia.org/wiki/Mersenne_Twister. Generates uniformly distributed 32-bit integers in the range [0, 232 − 1] with the MT19937 algorithm
Yaşar Arabacı <yasar11732 et gmail nokta com>
"""
# Create a length 624 list to store the state of the generator
MT = [0 for i in xrange(624)]
index = 0
# To get last 32 bits
bitmask_1 = (2 ** 32) - 1
# To get 32. bit
bitmask_2 = 2 ** 31
# To get last 31 bits
bitmask_3 = (2 ** 31) - 1
def initialize_generator(seed):
"Initialize the generator from a seed"
global MT
global bitmask_1
MT[0] = seed
for i in xrange(1,624):
MT[i] = ((1812433253 * MT[i-1]) ^ ((MT[i-1] >> 30) + i)) & bitmask_1
def extract_number():
"""
Extract a tempered pseudorandom number based on the index-th value,
calling generate_numbers() every 624 numbers
"""
global index
global MT
if index == 0:
generate_numbers()
y = MT[index]
y ^= y >> 11
y ^= (y << 7) & 2636928640
y ^= (y << 15) & 4022730752
y ^= y >> 18
index = (index + 1) % 624
return y
def generate_numbers():
"Generate an array of 624 untempered numbers"
global MT
for i in xrange(624):
y = (MT[i] & bitmask_2) + (MT[(i + 1 ) % 624] & bitmask_3)
MT[i] = MT[(i + 397) % 624] ^ (y >> 1)
if y % 2 != 0:
MT[i] ^= 2567483615
if __name__ == "__main__":
from datetime import datetime
now = datetime.now()
solved = False
initialize_generator(now.microsecond)
#for i in xrange(10):
# "Print 10 random numbers as an example"
while(solved != True):
generated_number = extract_number()
while(generated_number <= 20 and generated_number >= 1):
print generated_number
solved = True
Any advice on how to implement? It appears that it may not even get a chance to drop down to a number within the predefined set.
I am trying to find the fastest way of xor'ing all integers(numerals actually) in a string consecutively. The problem makes me feel that there is an simple and a fast answer I just can't think of. But here what I came up with so far.
Setup:
from operator import xor
a = "123"
Regular loop
val = 0
for i in a:
val = val ^ int(i)
print val
operations.xor with reduce
reduce(xor, map(int, list(a)))
I expected the second one to be faster but when the string grows, the difference is almost none. Is there a faster way ?
Note1: And I would like to know if it is possible using just the integer as 123 instead of the string "123". That would be unlogical because I need a list of integers, however sometimes interesting answers appear from places you never expect.
Edit: Here is the results from the methods suggested so far.
import timeit
setup = """
from operator import xor
a = "124"
b = 124
"""
p1 = """
val = 0
for i in a:
val = val ^ int(i)
val
"""
p2 = """
reduce(xor, map(int, list(a)))
"""
p3 = """
val = 0
for i in xrange(3):
val = val ^ (b % 10)
b /= 10
val
"""
p4 = """
15 & reduce(xor, map(ord, a))
"""
print 1, timeit.timeit(p1, setup=setup, number = 100000)
print 2, timeit.timeit(p2, setup=setup, number = 100000)
print 3, timeit.timeit(p3, setup=setup, number = 100000)
print 4, timeit.timeit(p4, setup=setup, number = 100000)
# Gives
1 0.251768243842
2 0.377706036384
3 0.0885620849347
4 0.140079894386
Please also note that using int(a) instead of b in process 3 makes it slower than 4.
On my (Python 3) system, this rework of the solution runs measureably faster than those shown:
from operator import xor
from functools import reduce
print(15 & reduce(xor, map(ord, a)))
If we know they are all digits, 15 & ord('5') pulls out the bits we need with less overhead than int('5'). And we can delay the logical "and", doing it just once at the end.
To use a number instead of a string, you can do:
b = 31415926535897932384626433832795028841971693993751058209749445923078164
val = 0
while b:
b, modulo = divmod(b, 10)
val ^= modulo
print(val)
Lemme clarify:
What would be the fastest way to get every number with all unique digits between two numbers. For example, 10,000 and 100,000.
Some obvious ones would be 12,345 or 23,456. I'm trying to find a way to gather all of them.
for i in xrange(LOW, HIGH):
str_i = str(i)
...?
Use itertools.permutations:
from itertools import permutations
result = [
a * 10000 + b * 1000 + c * 100 + d * 10 + e
for a, b, c, d, e in permutations(range(10), 5)
if a != 0
]
I used the fact, that:
numbers between 10000 and 100000 have either 5 or 6 digits, but only 6-digit number here does not have unique digits,
itertools.permutations creates all combinations, with all orderings (so both 12345 and 54321 will appear in the result), with given length,
you can do permutations directly on sequence of integers (so no overhead for converting the types),
EDIT:
Thanks for accepting my answer, but here is the data for the others, comparing mentioned results:
>>> from timeit import timeit
>>> stmt1 = '''
a = []
for i in xrange(10000, 100000):
s = str(i)
if len(set(s)) == len(s):
a.append(s)
'''
>>> stmt2 = '''
result = [
int(''.join(digits))
for digits in permutations('0123456789', 5)
if digits[0] != '0'
]
'''
>>> setup2 = 'from itertools import permutations'
>>> stmt3 = '''
result = [
x for x in xrange(10000, 100000)
if len(set(str(x))) == len(str(x))
]
'''
>>> stmt4 = '''
result = [
a * 10000 + b * 1000 + c * 100 + d * 10 + e
for a, b, c, d, e in permutations(range(10), 5)
if a != 0
]
'''
>>> setup4 = setup2
>>> timeit(stmt1, number=100)
7.955858945846558
>>> timeit(stmt2, setup2, number=100)
1.879319190979004
>>> timeit(stmt3, number=100)
8.599710941314697
>>> timeit(stmt4, setup4, number=100)
0.7493319511413574
So, to sum up:
solution no. 1 took 7.96 s,
solution no. 2 (my original solution) took 1.88 s,
solution no. 3 took 8.6 s,
solution no. 4 (my updated solution) took 0.75 s,
Last solution looks around 10x faster than solutions proposed by others.
Note: My solution has some imports that I did not measure. I assumed your imports will happen once, and code will be executed multiple times. If it is not the case, please adapt the tests to your needs.
EDIT #2: I have added another solution, as operating on strings is not even necessary - it can be achieved by having permutations of real integers. I bet this can be speed up even more.
Cheap way to do this:
for i in xrange(LOW, HIGH):
s = str(i)
if len(set(s)) == len(s):
# number has unique digits
This uses a set to collect the unique digits, then checks to see that there are as many unique digits as digits in total.
List comprehension will work a treat here (logic stolen from nneonneo):
[x for x in xrange(LOW,HIGH) if len(set(str(x)))==len(str(x))]
And a timeit for those who are curious:
> python -m timeit '[x for x in xrange(10000,100000) if len(set(str(x)))==len(str(x))]'
10 loops, best of 3: 101 msec per loop
Here is an answer from scratch:
def permute(L, max_len):
allowed = L[:]
results, seq = [], range(max_len)
def helper(d):
if d==0:
results.append(''.join(seq))
else:
for i in xrange(len(L)):
if allowed[i]:
allowed[i]=False
seq[d-1]=L[i]
helper(d-1)
allowed[i]=True
helper(max_len)
return results
A = permute(list("1234567890"), 5)
print A
print len(A)
print all(map(lambda a: len(set(a))==len(a), A))
It perhaps could be further optimized by using an interval representation of the allowed elements, although for n=10, I'm not sure it will make a difference. I could also transform the recursion into a loop, but in this form it is more elegant and clear.
Edit: Here are the timings of the various solutions
2.75808000565 (My solution)
8.22729802132 (Sol 1)
1.97218298912 (Sol 2)
9.659760952 (Sol 3)
0.841020822525 (Sol 4)
no_list=['115432', '555555', '1234567', '5467899', '3456789', '987654', '444444']
rep_list=[]
nonrep_list=[]
for no in no_list:
u=[]
for digit in no:
# print(digit)
if digit not in u:
u.append(digit)
# print(u)
#iF REPEAT IS THERE
if len(no) != len(u):
# print(no)
rep_list.append(no)
#If repeatation is not there
else:
nonrep_list.append(no)
print('Numbers which have no repeatation are=',rep_list)
print('Numbers which have repeatation are=',nonrep_list)
I rewrote the original radix sort algorithm for Python from Wikipedia using arrays from SciPy to gain performance and to reduce code length, which I managed to accomplish. Then I took the classic (in-memory, pivot based) quick sort algorithm from Literate Programming and compared their performance.
I had the expectation that radix sort will beat quick sort beyond a certain threshold, which it did not. Further, I found Erik Gorset's Blog's asking the question "Is radix sort faster than quick sort for integer arrays?". There the answer is that
.. the benchmark shows the MSB in-place radix sort to be consistently over 3 times faster than quicksort for large arrays.
Unfortunately, I could not reproduce the result; the differences are that (a) Erik chose Java and not Python and (b) he uses the MSB in-place radix sort, whereas I just fill buckets inside a Python dictionary.
According to theory radix sort should be faster (linear) compared to quick sort; but apparently it depends a lot on the implementation. So where is my mistake?
Here is the code comparing both algorithms:
from sys import argv
from time import clock
from pylab import array, vectorize
from pylab import absolute, log10, randint
from pylab import semilogy, grid, legend, title, show
###############################################################################
# radix sort
###############################################################################
def splitmerge0 (ls, digit): ## python (pure!)
seq = map (lambda n: ((n // 10 ** digit) % 10, n), ls)
buf = {0:[], 1:[], 2:[], 3:[], 4:[], 5:[], 6:[], 7:[], 8:[], 9:[]}
return reduce (lambda acc, key: acc.extend(buf[key]) or acc,
reduce (lambda _, (d,n): buf[d].append (n) or buf, seq, buf), [])
def splitmergeX (ls, digit): ## python & numpy
seq = array (vectorize (lambda n: ((n // 10 ** digit) % 10, n)) (ls)).T
buf = {0:[], 1:[], 2:[], 3:[], 4:[], 5:[], 6:[], 7:[], 8:[], 9:[]}
return array (reduce (lambda acc, key: acc.extend(buf[key]) or acc,
reduce (lambda _, (d,n): buf[d].append (n) or buf, seq, buf), []))
def radixsort (ls, fn = splitmergeX):
return reduce (fn, xrange (int (log10 (absolute (ls).max ()) + 1)), ls)
###############################################################################
# quick sort
###############################################################################
def partition (ls, start, end, pivot_index):
lower = start
upper = end - 1
pivot = ls[pivot_index]
ls[pivot_index] = ls[end]
while True:
while lower <= upper and ls[lower] < pivot: lower += 1
while lower <= upper and ls[upper] >= pivot: upper -= 1
if lower > upper: break
ls[lower], ls[upper] = ls[upper], ls[lower]
ls[end] = ls[lower]
ls[lower] = pivot
return lower
def qsort_range (ls, start, end):
if end - start + 1 < 32:
insertion_sort(ls, start, end)
else:
pivot_index = partition (ls, start, end, randint (start, end))
qsort_range (ls, start, pivot_index - 1)
qsort_range (ls, pivot_index + 1, end)
return ls
def insertion_sort (ls, start, end):
for idx in xrange (start, end + 1):
el = ls[idx]
for jdx in reversed (xrange(0, idx)):
if ls[jdx] <= el:
ls[jdx + 1] = el
break
ls[jdx + 1] = ls[jdx]
else:
ls[0] = el
return ls
def quicksort (ls):
return qsort_range (ls, 0, len (ls) - 1)
###############################################################################
if __name__ == "__main__":
###############################################################################
lower = int (argv [1]) ## requires: >= 2
upper = int (argv [2]) ## requires: >= 2
color = dict (enumerate (3*['r','g','b','c','m','k']))
rslbl = "radix sort"
qslbl = "quick sort"
for value in xrange (lower, upper):
#######################################################################
ls = randint (1, value, size=value)
t0 = clock ()
rs = radixsort (ls)
dt = clock () - t0
print "%06d -- t0:%0.6e, dt:%0.6e" % (value, t0, dt)
semilogy (value, dt, '%s.' % color[int (log10 (value))], label=rslbl)
#######################################################################
ls = randint (1, value, size=value)
t0 = clock ()
rs = quicksort (ls)
dt = clock () - t0
print "%06d -- t0:%0.6e, dt:%0.6e" % (value, t0, dt)
semilogy (value, dt, '%sx' % color[int (log10 (value))], label=qslbl)
grid ()
legend ((rslbl,qslbl), numpoints=3, shadow=True, prop={'size':'small'})
title ('radix & quick sort: #(integer) vs duration [s]')
show ()
###############################################################################
###############################################################################
And here is the result comparing sorting durations in seconds (logarithmic vertical axis) for integer arrays of size in range from 2 to 1250 (horizontal axis); lower curve belongs to quick sort:
Radix vs Quick Sort Comparison
Quick sort is smooth at the power changes (e.g. at 10, 100 or 1000), but radix sort just jumps a little but follows otherwise qualitatively the same path as quick sort, just much slower!
You have several problems here.
First of all, as pointed out in the comments, your data set is far too small for the theoretical complexity to overcome the overheads in the code.
Next your implementation with all those unnecessary function calls and copying lists around is very inefficient. Writing the code in a straightforward procedural manner will almost always be faster than a functional solution (for Python that is, other languages will differ here). You have a procedural implementation of quicksort so if you write your radix sort in the same style it may turn out faster even for small lists.
Finally, it may be that when you do try large lists the overheads of memory management begin to dominate. That means that you have a limited window between small lists where the efficiency of the implementation is the dominant factor and large lists where the memory management is the dominant factor.
Here's some code that uses your quicksort but a simple radixsort written procedurally but trying to avoid so much copying of data. You'll see that even for short lists it beats the quicksort but more interestingly as the data size goes up so does the ratio between quicksort and radix sort and then it begins to drop again as the memory management starts to dominate (simple things like freeing a list of 1,000,000 items take a significant time):
from random import randint
from math import log10
from time import clock
from itertools import chain
def splitmerge0 (ls, digit): ## python (pure!)
seq = map (lambda n: ((n // 10 ** digit) % 10, n), ls)
buf = {0:[], 1:[], 2:[], 3:[], 4:[], 5:[], 6:[], 7:[], 8:[], 9:[]}
return reduce (lambda acc, key: acc.extend(buf[key]) or acc,
reduce (lambda _, (d,n): buf[d].append (n) or buf, seq, buf), [])
def splitmerge1 (ls, digit): ## python (readable!)
buf = [[] for i in range(10)]
divisor = 10 ** digit
for n in ls:
buf[(n//divisor)%10].append(n)
return chain(*buf)
def radixsort (ls, fn = splitmerge1):
return list(reduce (fn, xrange (int (log10 (max(abs(val) for val in ls)) + 1)), ls))
###############################################################################
# quick sort
###############################################################################
def partition (ls, start, end, pivot_index):
lower = start
upper = end - 1
pivot = ls[pivot_index]
ls[pivot_index] = ls[end]
while True:
while lower <= upper and ls[lower] < pivot: lower += 1
while lower <= upper and ls[upper] >= pivot: upper -= 1
if lower > upper: break
ls[lower], ls[upper] = ls[upper], ls[lower]
ls[end] = ls[lower]
ls[lower] = pivot
return lower
def qsort_range (ls, start, end):
if end - start + 1 < 32:
insertion_sort(ls, start, end)
else:
pivot_index = partition (ls, start, end, randint (start, end))
qsort_range (ls, start, pivot_index - 1)
qsort_range (ls, pivot_index + 1, end)
return ls
def insertion_sort (ls, start, end):
for idx in xrange (start, end + 1):
el = ls[idx]
for jdx in reversed (xrange(0, idx)):
if ls[jdx] <= el:
ls[jdx + 1] = el
break
ls[jdx + 1] = ls[jdx]
else:
ls[0] = el
return ls
def quicksort (ls):
return qsort_range (ls, 0, len (ls) - 1)
if __name__=='__main__':
for value in 1000, 10000, 100000, 1000000, 10000000:
ls = [randint (1, value) for _ in range(value)]
ls2 = list(ls)
last = -1
start = clock()
ls = radixsort(ls)
end = clock()
for i in ls:
assert last <= i
last = i
print("rs %d: %0.2fs" % (value, end-start))
tdiff = end-start
start = clock()
ls2 = quicksort(ls2)
end = clock()
last = -1
for i in ls2:
assert last <= i
last = i
print("qs %d: %0.2fs %0.2f%%" % (value, end-start, ((end-start)/tdiff*100)))
The output when I run this is:
C:\temp>c:\python27\python radixsort.py
rs 1000: 0.00s
qs 1000: 0.00s 212.98%
rs 10000: 0.02s
qs 10000: 0.05s 291.28%
rs 100000: 0.19s
qs 100000: 0.58s 311.98%
rs 1000000: 2.47s
qs 1000000: 7.07s 286.33%
rs 10000000: 31.74s
qs 10000000: 86.04s 271.08%
Edit:
Just to clarify. The quicksort implementation here is very memory friendly, it sorts in-place so no matter how large the list it is just shuffling data around not copying it. The original radixsort effectively copies the list twice for each digit: once into the smaller lists and then again when you concatenate the lists. Using itertools.chain avoids that second copy but there's still a lot of memory allocation/deallocation going on. (Also 'twice' is approximate as list appending does involve extra copying even if it is amortized O(1) so I should maybe say 'proportional to twice'.)
Your data representation is very expensive. Why do you use a hashmap for your buckets? Why use a base10 representation that you need to compute logarithms (= expensive to compute) for?
Avoid lambda expressions and such, I don't think python can optimize them very well yet.
Maybe start with sorting 10-byte strings for the benchmark instead. And: no Hashmaps and similar expensive datastructures.