Entropy Estimator based on the Lempel-Ziv algorithm using Python - python

This function allows to estimate the entropy of a time series. It is based on the Lempel-Ziv compression algorithm. For a time series of length n, the entropy is estimate as:
E= (1/n SUM_i L_i )^-1 ln(n)
where L_i is the longness of the shortest substring starting at position i which doesn't previously appear from position 1 to i-1. The estimated entropy converges to the real entropy of the time series when n approaches to infinity.
There is already an implementation in MATLAB functions:
https://cn.mathworks.com/matlabcentral/fileexchange/51042-entropy-estimator-based-on-the-lempel-ziv-algorithm?s_tid=prof_contriblnk
I would like to implement is in Python and I did it like this:
def contains(small, big):
for i in range(len(big)-len(small)+1):
if big[i:i+len(small)] == small:
return True
return False
def actual_entropy(l):
n = len(l)
sequence = [l[0]]
sum_gamma = 0
for i in range(1, n):
for j in range(i+1, n+1):
s = l[i:j]
if contains(s, sequence) != True:
sum_gamma += len(s)
sequence.append(l[i])
break
ae = 1 / (sum_gamma / n ) * math.log(n)
return ae
However, I found it calculate too slow when the data size is getting bigger. For example, I used a list of 23832 elements as an input and time consumed is like this: (data can be found here)
0-1000: 1.7068431377410889 s
1000-2000: 18.561192989349365 s
2000-3000: 84.82257103919983 s
3000-4000: 243.5819959640503 s
...
I have thousands of lists like this to be calculated and such long time is unbearable. How should I optimize this function and make it work faster?

I played around a bit and tried a few different approaches from another thread on StackOverflow. And this is the code I came up with:
def contains(small, big):
try:
big.tostring().index(small.tostring())//big.itemsize
return True
except ValueError:
return False
def actual_entropy(l):
n = len(l)
sum_gamma = 0
for i in range(1, n):
sequence = l[:i]
for j in range(i+1, n+1):
s = l[i:j]
if contains(s, sequence) != True:
sum_gamma += len(s)
break
ae = 1 / (sum_gamma / n) * math.log(n)
return ae
Funnily enough casting the numpy arrays to strings is faster than working with strings directly. A very crude benchmark of my code on my machine with the data you provide is:
N: my code - your code
1000: 0.039s - 1.039s
2000: 0.266s - 18.490s
3000: 0.979s - 74.761s
4000: 2.891s - 285.488s
You maybe can make this even faster if you parallelize the outer loop.

Related

How to speed up search for abundant numbers?

Is there a way which this code could be improved so that it would run faster? Currently, this task takes between 11 and 12 seconds to run on my virtual environment
def divisors(n):
return sum([x for x in range(1, (round(n/2))) if n % x == 0])
def abundant_numbers():
return [x for x in range(1, 28123) if x < divisors(x)]
result = abundant_numbers()
Whenever you look for speeding up, you should first check whether the algorithm itself should change. And in this case it should.
Instead of looking for divisors given a number, look for numbers that divide by a divisor. For the latter you can use a sieve-like approach. That leads to this algorithm:
def abundant_numbers(n):
# All numbers are strict multiples of 1, except 0 and 1
divsums = [1] * n
for div in range(2, n//2 + 1): # Corrected end-of-range
for i in range(2*div, n, div):
divsums[i] += div # Sum up divisors for number i
divsums[0] = 0 # Make sure that 0 is not counted
return [i for i, divsum in enumerate(divsums) if divsum > i]
result = abundant_numbers(28123)
This runs quite fast, many factors faster than the translation of your algorithm to numpy.
Note that you had a bug in your code. round(n/2) as the range-end can miss a divisor. It should be n//2+1.

Severe efficiency problems with generating combinations

I'm trying to solve this question:
Given a positive integral number n, return a strictly increasing
sequence (list/array/string depending on the language) of numbers, so
that the sum of the squares is equal to n².
If there are multiple solutions (and there will be), return the result
with the largest possible value:
Basically, a squared number deconstructed into smaller squares. However, my code only works for smalls numbers efficiently (20 being roughly the max for the first piece of code, 30~ for second) and is exponentially slower onwards. How can I develop the code?
I don't know enough about efficiency in a language and be able to apply to my code and I tried optimising it by reducing the memory usage by setting comb to take out any bits of irrelevant data for the one below
For the second piece of code I tried using recursion to solve my problem but it still is getting exponentially slower. Perhaps I need a new method? Much appreciated.
import itertools as it
from math import sqrt
def decompose(n):
squares = [i ** 2 for i in range(1, n) if (i ** 2)/2 < n ** 2]
comb = [list(i) for i in (reduce(lambda acc, x: acc + list(it.combinations(squares, x)),
range(1, len(squares) + 1), [])) if sum(i) == n ** 2]
print [int(sqrt(i)) for i in max(comb)]
decompose(20)
this was the first attempt, failed in efficiency so I tried this.
from math import sqrt
stuff = []
def decompose(a):
def subset_sum(numbers, target, partial=[]):
s = sum(partial)
if s == target:
stuff.append(partial)
if s >= target:
return
for i in range(len(numbers)):
n = numbers[i]
remaining = numbers[i+1:]
subset_sum(remaining, target, partial + [n])
compare = 1
large = None
subset_sum([x**2 for x in range(1, a)], a**2)
for y in stuff:
if compare < y[-1]:
compare = y[-1]
large = y
print [int(sqrt(o)) for o in large]
decompose(30)

Converting a function with two recursive calls into an interative function

I've got a function that has two recursive calls and I'm trying to convert it to an iterative function. I've got it figured out where I can do it with one call fairly easily, but I can't figure out how to incorporate the other call.
the function:
def specialMultiplication(n):
if n < 2:
return 1
return n * specialMultiplication(n-1) * specialMultiplication(n-2)
If I just had one of them, it would be really easily:
def specialMult(n, mult = 1):
while n > 1:
(n, mult) = (n-1, n * mult) # Or n-2 for the second one
return mult
I just can't figure out how to add the second call in to get the right answer overall. Thanks!
If you don't mind changing the structure of your algorithm a bit more, you can calculate the values in a bottom-up fashion, starting with the smallest values.
def specialMultiplication(max_n):
a = b = 1
for n in range(1, max_n+1):
a, b = b, a*b*n
return b
Convert the recursion to an iterative function using an auxiliary "todo list":
def specialMultiplication(n):
to_process = []
result = 1
if n >= 2:
to_process.append(n)
while to_process: # while list is not empty
n = to_process.pop()
result *= n
if n >= 3:
to_process.append(n-1)
if n >= 4:
to_process.append(n-2)
return result
create a work list (to_process)
if n >= 2, add n to the list
while to_process is not empty, pop item from list, multiply to result
if n-1 < 2, don't perform "left" operation (don't append to work list)
if n-2 < 2, don't perform "right" operation (don't append to work list)
This method has the advantage of consuming less stack. I've checked the results against recursive version for values from 1 to 25 and they were equal.
Note that it's still slow, since complexity is O(2^n) so it's beginning to be really slow from n=30 (time doubles when n increases by 1). n=28 is computed in 12 seconds on my laptop.
I've successfully used this method to fix a stack overflow problem when performing a flood fill algorithm: Fatal Python error: Cannot recover from stack overflow. During Flood Fill but here Blcknght answer is more adapted because it rethinks the way of computing it from the start.
The OP's function has the same recursive structure as the Fibonacci and Lucas functions, just with different values for f0, f1, and g:
f(0) = f0
f(1) = f1
f(n) = g(f(n-2), f(n-1), n)
This is an example of a recurrence relation. Here is an iterative version of the general solution that calculates f(n) in n steps. It corresponds to a bottom-up tail recursion.
def f(n):
if not isinstance(n, int): # Can be loosened a bit
raise TypeError('Input must be an int') # Can be more informative
if n < 0:
raise ValueError('Input must be non-negative')
if n == 0:
return f0
i, fi_1, fi = 1, f0, f1 # invariant: fi_1, fi = f(i-1), f(i)
while i < n:
i += 1
fi_1, fi = fi, g(fi_1, fi, n) # restore invariant for new i
return fi
Blckknight's answer is a simplified version of this

An algorithm for randomly generating integer partitions of a particular length, in Python?

I've been using the random_element() function provided by SAGE to generate random integer partitions for a given integer (N) that are a particular length (S). I'm trying to generate unbiased random samples from the set of all partitions for given values of N and S. SAGE's function quickly returns random partitions for N (i.e. Partitions(N).random_element()).
However, it slows immensely when adding S (i.e. Partitions(N,length=S).random_element()). Likewise, filtering out random partitions of N that are of length S is incredibly slow.
However, and I hope this helps someone, I've found that in the case when the function returns a partition of N not matching the length S, that the conjugate partition is often of length S. That is:
S = 10
N = 100
part = list(Partitions(N).random_element())
if len(part) != S:
SAD = list(Partition(part).conjugate())
if len(SAD) != S:
continue
This increases the rate at which partitions of length S are found and appears to produce unbiased samples (I've examined the results against entire sets of partitions for various values of N and S).
However, I'm using values of N (e.g. 10,000) and S (e.g. 300) that make even this approach impractically slow. The comment associated with SAGE's random_element() function admits there is plenty of room for optimization. So, is there a way to more quickly generate unbiased (i.e. random uniform) samples of integer partitions matching given values of N and S, perhaps, by not generating partitions that do not match S? Additionally, using conjugate partitions works well in many cases to produce unbiased samples, but I can't say that I precisely understand why.
Finally, I have a definitively unbiased method that has a zero rejection rate. Of course, I've tested it to make sure the results are representative samples of entire feasible sets. It's very fast and totally unbiased. Enjoy.
from sage.all import *
import random
First, a function to find the smallest maximum addend for a partition of n with s parts
def min_max(n,s):
_min = int(floor(float(n)/float(s)))
if int(n%s) > 0:
_min +=1
return _min
Next, A function that uses a cache and memoiziation to find the number of partitions
of n with s parts having x as the largest part. This is fast, but I think there's
a more elegant solution to be had. e.g., Often: P(N,S,max=K) = P(N-K,S-1)
Thanks to ante (https://stackoverflow.com/users/494076/ante) for helping me with this:
Finding the number of integer partitions given a total, a number of parts, and a maximum summand
D = {}
def P(n,s,x):
if n > s*x or x <= 0: return 0
if n == s*x: return 1
if (n,s,x) not in D:
D[(n,s,x)] = sum(P(n-i*x, s-i, x-1) for i in xrange(s))
return D[(n,s,x)]
Finally, a function to find uniform random partitions of n with s parts, with no rejection rate! Each randomly chosen number codes for a specific partition of n having s parts.
def random_partition(n,s):
S = s
partition = []
_min = min_max(n,S)
_max = n-S+1
total = number_of_partitions(n,S)
which = random.randrange(1,total+1) # random number
while n:
for k in range(_min,_max+1):
count = P(n,S,k)
if count >= which:
count = P(n,S,k-1)
break
partition.append(k)
n -= k
if n == 0: break
S -= 1
which -= count
_min = min_max(n,S)
_max = k
return partition
I ran into a similar problem when I was trying to calculate the probability of the strong birthday problem.
First off, the partition function explodes when given only modest amount of numbers. You'll be returning a LOT of information. No matter which method you're using N = 10000 and S = 300 will generate ridiculous amounts of data. It will be slow. Chances are any pure python implementation you use will be equally slow or slower. Look to making a CModule.
If you want to try python the approach I took as a combination of itertools and generators to keep memory usage down. I don't seem to have my code handy anymore, but here's a good impementation:
http://wordaligned.org/articles/partitioning-with-python
EDIT:
Found my code:
def partition(a, b=-1, limit=365):
if (b == -1):
b = a
if (a == 2 or a == 3):
if (b >= a and limit):
yield [a]
else:
return
elif (a > 3):
if (a <= b):
yield [a]
c = 0
if b > a-2:
c = a-2
else:
c = b
for i in xrange(c, 1, -1):
if (limit):
for j in partition(a-i, i, limit-1):
yield [i] + j
Simple approach: randomly assign the integers:
def random_partition(n, s):
partition = [0] * s
for x in range(n):
partition[random.randrange(s)] += 1
return partition

Subset sum Problem

recently I became interested in the subset-sum problem which is finding a zero-sum subset in a superset. I found some solutions on SO, in addition, I came across a particular solution which uses the dynamic programming approach. I translated his solution in python based on his qualitative descriptions. I'm trying to optimize this for larger lists which eats up a lot of my memory. Can someone recommend optimizations or other techniques to solve this particular problem? Here's my attempt in python:
import random
from time import time
from itertools import product
time0 = time()
# create a zero matrix of size a (row), b(col)
def create_zero_matrix(a,b):
return [[0]*b for x in xrange(a)]
# generate a list of size num with random integers with an upper and lower bound
def random_ints(num, lower=-1000, upper=1000):
return [random.randrange(lower,upper+1) for i in range(num)]
# split a list up into N and P where N be the sum of the negative values and P the sum of the positive values.
# 0 does not count because of additive identity
def split_sum(A):
N_list = []
P_list = []
for x in A:
if x < 0:
N_list.append(x)
elif x > 0:
P_list.append(x)
return [sum(N_list), sum(P_list)]
# since the column indexes are in the range from 0 to P - N
# we would like to retrieve them based on the index in the range N to P
# n := row, m := col
def get_element(table, n, m, N):
if n < 0:
return 0
try:
return table[n][m - N]
except:
return 0
# same definition as above
def set_element(table, n, m, N, value):
table[n][m - N] = value
# input array
#A = [1, -3, 2, 4]
A = random_ints(200)
[N, P] = split_sum(A)
# create a zero matrix of size m (row) by n (col)
#
# m := the number of elements in A
# n := P - N + 1 (by definition N <= s <= P)
#
# each element in the matrix will be a value of either 0 (false) or 1 (true)
m = len(A)
n = P - N + 1;
table = create_zero_matrix(m, n)
# set first element in index (0, A[0]) to be true
# Definition: Q(1,s) := (x1 == s). Note that index starts at 0 instead of 1.
set_element(table, 0, A[0], N, 1)
# iterate through each table element
#for i in xrange(1, m): #row
# for s in xrange(N, P + 1): #col
for i, s in product(xrange(1, m), xrange(N, P + 1)):
if get_element(table, i - 1, s, N) or A[i] == s or get_element(table, i - 1, s - A[i], N):
#set_element(table, i, s, N, 1)
table[i][s - N] = 1
# find zero-sum subset solution
s = 0
solution = []
for i in reversed(xrange(0, m)):
if get_element(table, i - 1, s, N) == 0 and get_element(table, i, s, N) == 1:
s = s - A[i]
solution.append(A[i])
print "Solution: ",solution
time1 = time()
print "Time execution: ", time1 - time0
I'm not quite sure if your solution is exact or a PTA (poly-time approximation).
But, as someone pointed out, this problem is indeed NP-Complete.
Meaning, every known (exact) algorithm has an exponential time behavior on the size of the input.
Meaning, if you can process 1 operation in .01 nanosecond then, for a list of 59 elements it'll take:
2^59 ops --> 2^59 seconds --> 2^26 years --> 1 year
-------------- ---------------
10.000.000.000 3600 x 24 x 365
You can find heuristics, which give you just a CHANCE of finding an exact solution in polynomial time.
On the other side, if you restrict the problem (to another) using bounds for the values of the numbers in the set, then the problem complexity reduces to polynomial time. But even then the memory space consumed will be a polynomial of VERY High Order.
The memory consumed will be much larger than the few gigabytes you have in memory.
And even much larger than the few tera-bytes on your hard drive.
( That's for small values of the bound for the value of the elements in the set )
May be this is the case of your Dynamic programing algorithm.
It seemed to me that you were using a bound of 1000 when building your initialization matrix.
You can try a smaller bound. That is... if your input is consistently consist of small values.
Good Luck!
Someone on Hacker News came up with the following solution to the problem, which I quite liked. It just happens to be in python :):
def subset_summing_to_zero (activities):
subsets = {0: []}
for (activity, cost) in activities.iteritems():
old_subsets = subsets
subsets = {}
for (prev_sum, subset) in old_subsets.iteritems():
subsets[prev_sum] = subset
new_sum = prev_sum + cost
new_subset = subset + [activity]
if 0 == new_sum:
new_subset.sort()
return new_subset
else:
subsets[new_sum] = new_subset
return []
I spent a few minutes with it and it worked very well.
An interesting article on optimizing python code is available here. Basically the main result is that you should inline your frequent loops, so in your case this would mean instead of calling get_element twice per loop, put the actual code of that function inside the loop in order to avoid the function call overhead.
Hope that helps! Cheers
, 1st eye catch
def split_sum(A):
N_list = 0
P_list = 0
for x in A:
if x < 0:
N_list+=x
elif x > 0:
P_list+=x
return [N_list, P_list]
Some advices:
Try to use 1D list and use bitarray to reduce memory footprint at minimum (http://pypi.python.org/pypi/bitarray) so you will just change get / set functon. This should reduce your memory footprint by at lest 64 (integer in list is pointer to integer whit type so it can be factor 3*32)
Avoid using try - catch, but figure out proper ranges at beginning, you might found out that you will gain huge speed.
The following code works for Python 3.3+ , I have used the itertools module in Python that has some great methods to use.
from itertools import chain, combinations
def powerset(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
nums = input("Enter the Elements").strip().split()
inputSum = int(input("Enter the Sum You want"))
for i, combo in enumerate(powerset(nums), 1):
sum = 0
for num in combo:
sum += int(num)
if sum == inputSum:
print(combo)
The Input Output is as Follows:
Enter the Elements 1 2 3 4
Enter the Sum You want 5
('1', '4')
('2', '3')
Just change the values in your set w and correspondingly make an array x as big as the len of w then pass the last value in the subsetsum function as the sum for which u want subsets and you wl bw done (if u want to check by giving your own values).
def subsetsum(cs,k,r,x,w,d):
x[k]=1
if(cs+w[k]==d):
for i in range(0,k+1):
if x[i]==1:
print (w[i],end=" ")
print()
elif cs+w[k]+w[k+1]<=d :
subsetsum(cs+w[k],k+1,r-w[k],x,w,d)
if((cs +r-w[k]>=d) and (cs+w[k]<=d)) :
x[k]=0
subsetsum(cs,k+1,r-w[k],x,w,d)
#driver for the above code
w=[2,3,4,5,0]
x=[0,0,0,0,0]
subsetsum(0,0,sum(w),x,w,7)

Categories

Resources