Recursive function definition in python - python

So I am trying to implement a version of Hartree-Fock theory for a band system. Basically, it's a matrix convergence problem. I have a matrix H0, from whose eigenvalues I can construct another matrix F. The procedure is then to define H1 = H0 + F and check if the eigenvalues of H1 is close to the ones of H0. If not, I construct a new F from eigenvalues of H1 and define H2 = H0 + F. Then check again and iterate.
The problem is somewhat generic and my exact code seems not really relevant. So I am showing only this:
# define the matrix F
def generate_fock(H):
def fock(k): #k is a 2D array
matt = some prefactor*outer(eigenvectors of H(k) with itself) #error1
return matt
return fock
k0 = linspace(0,5*kt/2,51)
# H0 is considered defined
H = lambda k: H0(k)
evalold = array([sort(linalg.eigvalsh(H(array([0,2*k-kt/2])))) for k in k0[:]])[the ones I care]
while True:
fe = generate_fock(H)
H = lambda k: H0(k)+fe(k) #error2
evalnew = array([sort(linalg.eigvalsh(H(array([0,2*k-kt/2])))) for k in k0[:]])[the ones I care]
if allclose(evalnew, evalold): break
else: evalold = evalnew
I am using inner functions, hoping that python would not find my definitions are recursive (I am not sure if I am using the word correctly). But python knows :( Any suggestions?
Edit1:
The error message is highlighting the lines I labeled error1 and error2 and showing the following:
RecursionError: maximum recursion depth exceeded while calling a Python object
I think this comes from my way of defining the functions: In loop n, F(k) depends on H(k) of the previous loop and H(k) in the next step depends on F(k) again. My question is how do I get around this?
Edit2&3:
Let me add more details to the code as suggested. This is the shortest thing I can come up with that exactly reproduce my problem.
from numpy import *
from scipy import linalg
# Let's say H0 is any 2m by 2m Hermitian matrix. m = 4 in this case.
# Here are some simplified parameters
def h(i,k):
return -6*linalg.norm(k)*array([[0,exp(1j*(angle(k#array([1,1j]))+(-1)**i*0.1/2))],
[exp(-1j*(angle(k#array([1,1j]))+(-1)**i*0.1/2)),0]])
T = array([ones([2,2]),
[[exp(-1j*2*pi/3),1],[exp(1j*2*pi/3),exp(-1j*2*pi/3)]],
[[exp(1j*2*pi/3),1],[exp(-1j*2*pi/3),exp(1j*2*pi/3)]]])
g = array([[ 0.27023695, 0.46806412], [-0.27023695, 0.46806412]])
kt = linalg.norm(g[0])
def H0(k):
"one example"
matt = linalg.block_diag(h(1,k),h(2,k+g[0]),h(2,k+g[1]),h(2,k+g[0]+g[1]))/2
for j in range(3): matt[0:2,2*j+2:2*j+4] = T[j]
return array(matrix(matt).getH())+matt
dim = 4
def bz(x):
"BZ centered at 0 with (2x)^2 points in it"
tempList = []
for i in range(-x,x):
for j in range(-x,x):
tempList.append(i*g[0]/2/x+j*g[1]/2/x)
return tempList
def V(i,G):
"2D Coulomb interaction"
if linalg.norm(G)==0: return 0
if i>=dim: t=1
else: t=0
return 2*pi/linalg.norm(G)*exp(0.3*linalg.norm(G)*(-1+(-1)**t)/2)
# define the matrix F for some H
def generate_fock(H):
def fock(k): #k is a 2D array
matf = zeros([2*dim,2*dim],dtype=complex128)
for pt in bz(1): #bz is a list of 2D arrays
matt = zeros([2*dim,2*dim],dtype=complex128)
eig_vals1, eig_vecs1 = linalg.eigh(H(pt)) #error1
idx = eig_vals1.argsort()[::]
vecs1 = eig_vecs1[:,idx][:dim]
for vec in vecs1:
matt = matt + outer(conjugate(vec),vec)
matt = matt.transpose()/len(bz(1))
for i in range(2*dim):
for j in range(2*dim):
matf[i,j] = V(j-i,pt-k)*matt[i,j] #V is some prefactor
return matf
return fock
k0 = linspace(0,5*kt/2,51)
H = lambda k: H0(k)
evalold = array([sort(linalg.eigvalsh(H(array([0,2*k-kt/2])))) for k in k0[:]])[dim-1:dim+1]
while True:
fe = generate_fock(H)
H = lambda k: H0(k)+fe(k) #error2
evalnew = array([sort(linalg.eigvalsh(H(array([0,2*k-kt/2])))) for k in k0[:]])[dim-1:dim+1]
if allclose(evalnew, evalold): break
else: evalold = evalnew

The problem is these lines:
while True:
fe = generate_fock(H)
H = lambda k: H0(k)+fe(k) #error2
In each iteration, you are generating a new function referencing the older one rather than the final output of that older function, so it has to keep them all on the stack. This will also be very slow, since you have to back multiply all your matrices every iteration.
What you want to do is keep the output of the old values, probably by making a list from the results of the prior iteration and then applying functions from that list.
Potentially you could even do this with a cache, though it might get huge. Keep a dictionary of inputs to the function and use that. Something like this:
# define the matrix F
def generate_fock(H):
d = {}
def fock(k): #k is a 2D array
if k in d:
return d[k]
matt = some prefactor*outer(eigenvectors of H(k) with itself) #error1
d[k] = matt
return matt
return fock
Then it should hopefully only have to reference the last version of the function.
EDIT: Give this a try. As well as caching the result, keep an index into an array of functions instead of a reference. This should prevent a recursion depth overflow.
hList = []
# define the matrix F
def generate_fock(n):
d = {}
def fock(k): #k is a 2D array
if k in d:
return d[k]
matt = some prefactor*outer(eigenvectors of hList[n](k) with itself) #error1
d[k] = matt
return matt
return fock
k0 = linspace(0,5*kt/2,51)
# H0 is considered defined
HList.append(lambda k: H0(k))
H = HList[0]
evalold = array([sort(linalg.eigvalsh(H(array([0,2*k-kt/2])))) for k in k0[:]])[the ones I care]
n = 0
while True:
fe = generate_fock(n)
n += 1
hList.append(lambda k: H0(k)+fe(k)) #error2
H = hList[-1]
evalnew = array([sort(linalg.eigvalsh(H(array([0,2*k-kt/2])))) for k in k0[:]])[the ones I care]
if allclose(evalnew, evalold): break
else: evalold = evalnew

Related

backtracking not trying all possibilities

so I've got a list of questions as a dictionary, e.g
{"Question1": 3, "Question2": 5 ... }
That means the "Question1" has 3 points, the second one has 5, etc.
I'm trying to create all subset of question that have between a certain number of questions and points.
I've tried something like
questions = {"Q1":1, "Q2":2, "Q3": 1, "Q4" : 3, "Q5" : 1, "Q6" : 2}
u = 3 #
v = 5 # between u and v questions
x = 5 #
y = 10 #between x and y points
solution = []
n = 0
def main(n_):
global n
n = n_
global solution
solution = []
finalSolution = []
for x in questions.keys():
solution.append("_")
finalSolution.extend(Backtracking(0))
return finalSolution
def Backtracking(k):
finalSolution = []
for c in questions.keys():
solution[k] = c
print ("candidate: ", solution)
if not reject(k):
print ("not rejected: ", solution)
if accept(k):
finalSolution.append(list(solution))
else:
finalSolution.extend(Backtracking(k+1))
return finalSolution
def reject(k):
if solution[k] in solution: #if the question already exists
return True
if k > v: #too many questions
return True
points = 0
for x in solution:
if x in questions.keys():
points = points + questions[x]
if points > y: #too many points
return True
return False
def accept(k):
points = 0
for x in solution:
if x in questions.keys():
points = points + questions[x]
if points in range (x, y+1) and k in range (u, v+1):
return True
return False
print(main(len(questions.keys())))
but it's not trying all possibilities, only putting all the questions on the first index..
I have no idea what I'm doing wrong.
There are three problems with your code.
The first issue is that the first check in your reject function is always True. You can fix that in a variety of ways (you commented that you're now using solution.count(solution[k]) != 1).
The second issue is that your accept function uses the variable name x for what it intends to be two different things (a question from solution in the for loop and the global x that is the minimum number of points). That doesn't work, and you'll get a TypeError when trying to pass it to range. A simple fix is to rename the loop variable (I suggest q since it's a key into questions). Checking if a value is in a range is also a bit awkward. It's usually much nicer to use chained comparisons: if x <= points <= y and u <= k <= v
The third issue is that you're not backtracking at all. The backtracking step needs to reset the global solution list to the same state it had before Backtracking was called. You can do this at the end of the function, just before you return, using solution[k] = "_" (you commented that you've added this line, but I think you put it in the wrong place).
Anyway, here's a fixed version of your functions:
def Backtracking(k):
finalSolution = []
for c in questions.keys():
solution[k] = c
print ("candidate: ", solution)
if not reject(k):
print ("not rejected: ", solution)
if accept(k):
finalSolution.append(list(solution))
else:
finalSolution.extend(Backtracking(k+1))
solution[k] = "_" # backtracking step here!
return finalSolution
def reject(k):
if solution.count(solution[k]) != 1: # fix this condition
return True
if k > v:
return True
points = 0
for q in solution:
if q in questions:
points = points + questions[q]
if points > y: #too many points
return True
return False
def accept(k):
points = 0
for q in solution: # change this loop variable (also done above, for symmetry)
if q in questions:
points = points + questions[q]
if x <= points <= y and u <= k <= v: # chained comparisons are much nicer than range
return True
return False
There are still things that could probably be improved in there. I think having solution be a fixed-size global list with dummy values is especially unpythonic (a dynamically growing list that you pass as an argument would be much more natural). I'd also suggest using sum to add up the points rather than using an explicit loop of your own.

Python - speed up pathfinding

This is my pathfinding function:
def get_distance(x1,y1,x2,y2):
neighbors = [(-1,0),(1,0),(0,-1),(0,1)]
old_nodes = [(square_pos[x1,y1],0)]
new_nodes = []
for i in range(50):
for node in old_nodes:
if node[0].x == x2 and node[0].y == y2:
return node[1]
for neighbor in neighbors:
try:
square = square_pos[node[0].x+neighbor[0],node[0].y+neighbor[1]]
if square.lightcycle == None:
new_nodes.append((square,node[1]))
except KeyError:
pass
old_nodes = []
old_nodes = list(new_nodes)
new_nodes = []
nodes = []
return 50
The problem is that the AI takes to long to respond( response time <= 100ms)
This is just a python way of doing https://en.wikipedia.org/wiki/Pathfinding#Sample_algorithm
You should replace your algorithm with A*-search with the Manhattan distance as a heuristic.
One reasonably fast solution is to implement the Dijkstra algorithm (that I have already implemented in that question):
Build the original map. It's a masked array where the walker cannot walk on masked element:
%pylab inline
map_size = (20,20)
MAP = np.ma.masked_array(np.zeros(map_size), np.random.choice([0,1], size=map_size))
matshow(MAP)
Below is the Dijkstra algorithm:
def dijkstra(V):
mask = V.mask
visit_mask = mask.copy() # mask visited cells
m = numpy.ones_like(V) * numpy.inf
connectivity = [(i,j) for i in [-1, 0, 1] for j in [-1, 0, 1] if (not (i == j == 0))]
cc = unravel_index(V.argmin(), m.shape) # current_cell
m[cc] = 0
P = {} # dictionary of predecessors
#while (~visit_mask).sum() > 0:
for _ in range(V.size):
#print cc
neighbors = [tuple(e) for e in asarray(cc) - connectivity
if e[0] > 0 and e[1] > 0 and e[0] < V.shape[0] and e[1] < V.shape[1]]
neighbors = [ e for e in neighbors if not visit_mask[e] ]
tentative_distance = [(V[e]-V[cc])**2 for e in neighbors]
for i,e in enumerate(neighbors):
d = tentative_distance[i] + m[cc]
if d < m[e]:
m[e] = d
P[e] = cc
visit_mask[cc] = True
m_mask = ma.masked_array(m, visit_mask)
cc = unravel_index(m_mask.argmin(), m.shape)
return m, P
def shortestPath(start, end, P):
Path = []
step = end
while 1:
Path.append(step)
if step == start: break
if P.has_key(step):
step = P[step]
else:
break
Path.reverse()
return asarray(Path)
And the result:
start = (2,8)
stop = (17,19)
D, P = dijkstra(MAP)
path = shortestPath(start, stop, P)
imshow(MAP, interpolation='nearest')
plot(path[:,1], path[:,0], 'ro-', linewidth=2.5)
Below some timing statistics:
%timeit dijkstra(MAP)
#10 loops, best of 3: 32.6 ms per loop
The biggest issue with your code is that you don't do anything to avoid the same coordinates being visited multiple times. This means that the number of nodes you visit is guaranteed to grow exponentially, since it can keep going back and forth over the first few nodes many times.
The best way to avoid duplication is to maintain a set of the coordinates we've added to the queue (though if your node values are hashable, you might be able to add them directly to the set instead of coordinate tuples). Since we're doing a breadth-first search, we'll always reach a given coordinate by (one of) the shortest path(s), so we never need to worry about finding a better route later on.
Try something like this:
def get_distance(x1,y1,x2,y2):
neighbors = [(-1,0),(1,0),(0,-1),(0,1)]
nodes = [(square_pos[x1,y1],0)]
seen = set([(x1, y1)])
for node, path_length in nodes:
if path_length == 50:
break
if node.x == x2 and node.y == y2:
return path_length
for nx, ny in neighbors:
try:
square = square_pos[node.x + nx, node.y + ny]
if square.lightcycle == None and (square.x, square.y) not in seen:
nodes.append((square, path_length + 1))
seen.add((square.x, square.y))
except KeyError:
pass
return 50
I've also simplified the loop a bit. Rather than switching out the list after each depth, you can just use one loop and add to its end as you're iterating over the earlier values. I still abort if a path hasn't been found with fewer than 50 steps (using the distance stored in the 2-tuple, rather than the number of passes of the outer loop). A further improvement might be to use a collections.dequeue for the queue, since you could efficiently pop from one end while appending to the other end. It probably won't make a huge difference, but might avoid a little bit of memory usage.
I also avoided most of the indexing by one and zero in favor of unpacking into separate variable names in the for loops. I think this is much easier to read, and it avoids confusion since the two different kinds of 2-tuples had had different meanings (one is a node, distance tuple, the other is x, y).

Motif search with Gibbs sampler

I am a beginner in both programming and bioinformatics. So, I would appreciate your understanding. I tried to develop a python script for motif search using Gibbs sampling as explained in Coursera class, "Finding Hidden Messages in DNA". The pseudocode provided in the course is:
GIBBSSAMPLER(Dna, k, t, N)
randomly select k-mers Motifs = (Motif1, …, Motift) in each string
from Dna
BestMotifs ← Motifs
for j ← 1 to N
i ← Random(t)
Profile ← profile matrix constructed from all strings in Motifs
except for Motifi
Motifi ← Profile-randomly generated k-mer in the i-th sequence
if Score(Motifs) < Score(BestMotifs)
BestMotifs ← Motifs
return BestMotifs
Problem description:
CODE CHALLENGE: Implement GIBBSSAMPLER.
Input: Integers k, t, and N, followed by a collection of strings Dna.
Output: The strings BestMotifs resulting from running GIBBSSAMPLER(Dna, k, t, N) with
20 random starts. Remember to use pseudocounts!
Sample Input:
8 5 100
CGCCCCTCTCGGGGGTGTTCAGTAACCGGCCA
GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG
TAGTACCGAGACCGAAAGAAGTATACAGGCGT
TAGATCAAGTTTCAGGTGCACGTCGGTGAACC
AATCCACCAGCTCCACGTGCAATGTTGGCCTA
Sample Output:
TCTCGGGG
CCAAGGTG
TACAGGCG
TTCAGGTG
TCCACGTG
I followed the pseudocode to the best of my knowledge. Here is my code:
def BuildProfileMatrix(dnamatrix):
ProfileMatrix = [[1 for x in xrange(len(dnamatrix[0]))] for x in xrange(4)]
indices = {'A':0, 'C':1, 'G': 2, 'T':3}
for seq in dnamatrix:
for i in xrange(len(dnamatrix[0])):
ProfileMatrix[indices[seq[i]]][i] += 1
ProbMatrix = [[float(x)/sum(zip(*ProfileMatrix)[0]) for x in y] for y in ProfileMatrix]
return ProbMatrix
def ProfileRandomGenerator(profile, dna, k, i):
indices = {'A':0, 'C':1, 'G': 2, 'T':3}
score_list = []
for x in xrange(len(dna[i]) - k + 1):
probability = 1
window = dna[i][x : k + x]
for y in xrange(k):
probability *= profile[indices[window[y]]][y]
score_list.append(probability)
rnd = uniform(0, sum(score_list))
current = 0
for z, bias in enumerate(score_list):
current += bias
if rnd <= current:
return dna[i][z : k + z]
def score(motifs):
ProfileMatrix = [[0 for x in xrange(len(motifs[0]))] for x in xrange(4)]
indices = {'A':0, 'C':1, 'G': 2, 'T':3}
for seq in motifs:
for i in xrange(len(motifs[0])):
ProfileMatrix[indices[seq[i]]][i] += 1
score = len(motifs)*len(motifs[0]) - sum([max(x) for x in zip(*ProfileMatrix)])
return score
from random import randint, uniform
def GibbsSampler(k, t, N):
dna = ['CGCCCCTCTCGGGGGTGTTCAGTAACCGGCCA',
'GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG',
'TAGTACCGAGACCGAAAGAAGTATACAGGCGT',
'TAGATCAAGTTTCAGGTGCACGTCGGTGAACC',
'AATCCACCAGCTCCACGTGCAATGTTGGCCTA']
Motifs = []
for i in [randint(0, len(dna[0])-k) for x in range(len(dna))]:
j = 0
kmer = dna[j][i : k+i]
j += 1
Motifs.append(kmer)
BestMotifs = []
s_best = float('inf')
for i in xrange(N):
x = randint(0, t-1)
Motifs.pop(x)
profile = BuildProfileMatrix(Motifs)
Motif = ProfileRandomGenerator(profile, dna, k, x)
Motifs.append(Motif)
s_motifs = score(Motifs)
if s_motifs < s_best:
s_best = s_motifs
BestMotifs = Motifs
return [s_best, BestMotifs]
k, t, N =8, 5, 100
best_motifs = [float('inf'), None]
# Repeat the Gibbs sampler search 20 times.
for repeat in xrange(20):
current_motifs = GibbsSampler(k, t, N)
if current_motifs[0] < best_motifs[0]:
best_motifs = current_motifs
# Print and save the answer.
print '\n'.join(best_motifs[1])
Unfortunately, my code never gives the same output as the solved example. Besides, while trying to debug the code I found that I get weird scores that define the mismatches between motifs. However, when I tried to run the score function separately, it worked perfectly.
Each time I run the script, the output changes, but anyway here is an example of one of the outputs for the input present in the code:
Example output of my code
TATGTGTA
TATGTGTA
TATGTGTA
GGTGTTCA
TATACAGG
Could you please help me debug this code?!! I spent the whole day trying to find out what's wrong with it although I know it might be some silly mistake I made, but my eye failed to catch it.
Thank you all!!
Finally, I found out what was wrong in my code! It was in line 54:
Motifs.append(Motif)
After randomly removing one of the motifs, followed by building a profile out of these motifs then randomly selecting a new motif based on this profile, I should have added the selected motif in the same position before removal NOT appended to the end of the motif list.
Now, the correct code is:
Motifs.insert(x, Motif)
The new code worked as expected.

How to optimize python dynamic programming knapsack (multiprocessing?)

I've solved a problem on spoj, but it's still too slow for being accepted.
I've tried to make it use multiprocessing too, but I've failed because it's still slower.
The basic implemenation, even with pypy, returns "time limits exceeded" on spoj.
So, how can I improve it?
And what is wrong with the multiprocessing implementation?
# -- shipyard
from collections import defaultdict
#W = 100 total weight
#N = 2 number of types
#value | weight
#1 1
#30 50
# result -> 60 = minimum total value
#c = [1, 30]
#w = [1, 50]
def knap(W, N, c, w):
f = defaultdict(int)
g = defaultdict(bool)
g[0] = True
for i in xrange(N):
for j in xrange(W):
if g[j]:
g[j+w[i]] = True
#print "g("+str(j+w[i])+") = true"
if ( f[j+w[i]]==0 or f[j+w[i]]>f[j]+c[i]):
f[j+w[i]] = f[j]+c[i]
#print " f("+str(j+w[i])+") = ",f[j+w[i]]
if g[W]:
print f[W]
else:
print -1
def start():
while True:
num_test = int(raw_input())
for i in range(num_test):
totWeight = int(raw_input())
types = int(raw_input())
costs = defaultdict(int)
weights = defaultdict(int)
for t in range(int( types )):
costs[t], weights[t] = [int(i) for i in raw_input().split()]
knap(totWeight, types, costs, weights)
return
if __name__ == '__main__':
start()
And here is the multiprocessing version:
# -- shipyard
from multiprocessing import Process, Queue
from collections import defaultdict
from itertools import chain
W = 0
c = {} #[]
w = {} #[]
def knap(i, g, f, W, w, c, qG, qF):
for j in xrange( W ):
if g[j]:
g[j+w[i]] = True
#print "g("+str(j+w[i])+") = true"
if ( f[j+w[i]]==0 or f[j+w[i]]>f[j]+c[i]):
f[j+w[i]] = f[j]+c[i]
#print " f("+str(j+w[i])+") = ",f[j+w[i]]
qG.put( g)
qF.put( f)
def start():
global f, g, c, w, W
while True:
num_test = int(raw_input())
for _ in range(num_test):
qG = Queue()
qF = Queue()
W = int(raw_input())
N = int(raw_input()) # types
c = {} #[0 for i in range(N)]
w = {} #[0 for i in range(N)]
f = defaultdict(int)
g = defaultdict(bool)
g[0] = True
for t in range( N ):
c[t], w[t] = [int(i) for i in raw_input().split()]
# let's go parallel
for i in xrange(0, N, 2):
k1 = Process(target=knap, args=(i, g, f, W, w, c, qG, qF))
k2 = Process(target=knap, args=(i+1, g, f, W, w, c, qG, qF))
k1.start()
k2.start()
k1.join()
k2.join()
#while k1.is_alive(): # or k2.is_alive():
# None
#g2 = defaultdict(bool, chain( g.iteritems(), qG.get().iteritems(), qG.get().iteritems()))
#f2 = defaultdict(int, chain( f.iteritems(), qF.get().iteritems(), qF.get().iteritems()))
g2 = defaultdict(bool, g.items()+ qG.get().items()+ qG.get().items())
f2 = defaultdict(int, f.items()+ qF.get().items()+ qF.get().items())
g = g2
f = f2
print "\n g: ", len(g), "\n f: ", len(f),"\n"
if g[W]:
print f[W]
else:
print -1
return
if __name__ == '__main__':
start()
I probably haven't understood how to make two processes to work efficently on the same dictionary
Some programming contest will explicitly ban multithreading or block it so try to look elsewhere. My approach in those times is use a profiling tool to see where your code is struggling. You could try the built-in cProfile (python -m cProfile -o <outputfilename> <script-name> <options>) and then this wonderful visualization tool: http://www.vrplumber.com/programming/runsnakerun/
Once you have your visualization look around, dig into boxes. Some times there are things that are not directly evident but make sense once inspecting the running times. i.e. a common problem (not sure if it's your case) is checking for list membership. It's much faster to use a set in this case and often it pays to keep a separate list and set if you need the order later. There are many more tips regarding importing variables into local space, etc. You can check a list here: https://wiki.python.org/moin/PythonSpeed/PerformanceTips
Many people who use Python face the same problem on programming contest sites. I've found that its best to simply ditch Python altogether for problems which accept large inputs where you have to construct and iterate over a big data structure. Simply re-implement the same solution in C or C++. Python is known to be 10 to 100 times slower than well optimised C/C++ code.
Your code looks good and there's really little you can do to gain more speed (apart from big O improvements or going through hoops such as multiprocessing). If you have to use Python, try to avoid creating unnecessary large lists and use the most efficient algorithm you can think of. You can also try to generate and run large test cases prior to submitting your solution.

Query long lists

I would like to query the value of an exponentially weighted moving average at particular points. An inefficient way to do this is as follows. l is the list of times of events and queries has the times at which I want the value of this average.
a=0.01
l = [3,7,10,20,200]
y = [0]*1000
for item in l:
y[int(item)]=1
s = [0]*1000
for i in xrange(1,1000):
s[i] = a*y[i-1]+(1-a)*s[i-1]
queries = [23,68,103]
for q in queries:
print s[q]
Outputs:
0.0355271185019
0.0226018371526
0.0158992102478
In practice l will be very large and the range of values in l will also be huge. How can you find the values at the times in queries more efficiently, and especially without computing the potentially huge lists y and s explicitly. I need it to be in pure python so I can use pypy.
Is it possible to solve the problem in time proportional to len(l)
and not max(l) (assuming len(queries) < len(l))?
Here is my code for doing this:
def ewma(l, queries, a=0.01):
def decay(t0, x, t1, a):
from math import pow
return pow((1-a), (t1-t0))*x
assert l == sorted(l)
assert queries == sorted(queries)
samples = []
try:
t0, x0 = (0.0, 0.0)
it = iter(queries)
q = it.next()-1.0
for t1 in l:
# new value is decayed previous value, plus a
x1 = decay(t0, x0, t1, a) + a
# take care of all queries between t0 and t1
while q < t1:
samples.append(decay(t0, x0, q, a))
q = it.next()-1.0
# take care of all queries equal to t1
while q == t1:
samples.append(x1)
q = it.next()-1.0
# update t0, x0
t0, x0 = t1, x1
# take care of any remaining queries
while True:
samples.append(decay(t0, x0, q, a))
q = it.next()-1.0
except StopIteration:
return samples
I've also uploaded a fuller version of this code with unit tests and some comments to pastebin: http://pastebin.com/shhaz710
EDIT: Note that this does the same thing as what Chris Pak suggests in his answer, which he must have posted as I was typing this. I haven't gone through the details of his code, but I think mine is a bit more general. This code supports non-integer values in l and queries. It also works for any kind of iterables, not just lists since I don't do any indexing.
I think you could do it in ln(l) time, if l is sorted. The basic idea is that the non recursive form of EMA is a*s_i + (1-a)^1 * s_(i-1) + (1-a)^2 * s_(i-2) ....
This means for query k, you find the greatest number in l less than k, and for a estimation limit, use the following, where v is the index in l, l[v] is the value
(1-a)^(k-v) *l[v] + ....
Then, you spend lg(len(l)) time in search + a constant multiple for the depth of your estimation. I'll provide a code sample in a little bit (after work) if you want it, just wanted to get my idea out there while I was thinking about it
here's the code -
v is the dictionary of values at a given time; replace with 1 if it's just a 1 every time...
import math
from bisect import bisect_right
a = .01
limit = 1000
l = [1,5,14,29...]
def find_nearest_lt(l, time):
i = bisect_right(a, x)
if i:
return i-1
raise ValueError
def find_ema(l, time):
i = find_nearest_lt(l, time)
if l[i] == time:
result = a * v[l[i]
i -= 1
else:
result = 0
while (time-l[i]) < limit:
result += math.pow(1-a, time-l[i]) * v[l[i]]
i -= 1
return result
if I'm thinking correctly, the find nearest is l(n), then the while loop is <= 1000 iterations, guaranteed, so it's technically a constant (though a kind of large one). find_nearest was stolen from the page on bisect - http://docs.python.org/2/library/bisect.html
It appears that y is a binary value -- either 0 or 1 -- depending on the values of l. Why not use y = set(int(item) for item in l)? That's the most efficient way to store and look up a list of numbers.
Your code will cause an error the first time through this loop:
s = [0]*1000
for i in xrange(1000):
s[i] = a*y[i-1]+(1-a)*s[i-1]
because i-1 is -1 when i=0 (first pass of loop) and both y[-1] and s[-1] are the last element of the list, not the previous. Maybe you want xrange(1,1000)?
How about this code:
a=0.01
l = [3.0,7.0,10.0,20.0,200.0]
y = set(int(item) for item in l)
queries = [23,68,103]
ewma = []
x = 1 if (0 in y) else 0
for i in xrange(1, queries[-1]):
x = (1-a)*x
if i in y:
x += a
if i == queries[0]:
ewma.append(x)
queries.pop(0)
When it's done, ewma should have the moving averages for each query point.
Edited to include SchighSchagh's improvements.

Categories

Resources