Coding autocorrelation function correctly in python - python

I have a two data sets, each of which are 325 elements long. One is data along and X-axis, one is data along the y-axis. Together, Q[0] and Q[1] make up Q.
The equation should be something like:
(Q(i+deltai,j) - mean(Q_i,j)) x (Q_i,j - mean(Q_i,j)) / (Q_i,j - mean(Q_i,j))^2
Which is an autocorrelation function. I want to iterate through i + delta i for the first function while looping through the data in Q. I did start to write two for loops for this, and then tried np.correlate. In the end, I think I got confused being new to python and felt like I spent the last few days chasing my tail.
The code I tried is:
I = npzfile['i']
Q = npzfile['q']
U = npzfile['u']
nxside = Q.shape[0]
nyside = Q.shape[1]
for i in numpy.arange(0,nxside-1,1):
result = numpy.correlate(Q[0], Q, mode='full')
result2 = (result/float(result.max()))
plt.plot(result2, '-')
for i in numpy.arange(0,nxside-1,1):
for j in numpy.arange(0, nyside-1,1):
# # # J= (i)
# # # # L = ()
# # # # print L
M = (Q - numpy.mean(Q))
B = (Q[i] - numpy.mean(Q))
K = (Q - numpy.mean(Q))**2
Z = numpy.abs(M*B)/numpy.abs(K)
# print numpy.abs((M*B)/K)
plt.plot(Z, 'ro')
I'd expect the output to be something that varies between 0 and 1 on the y-axis, over 325 elements.

Related

delete consecutive elements in a pandas dataFrame given a certain rule?

I have a variable with zeros and ones. Each sequence of ones represent "a phase" I want to observe, each sequence of zeros represent the space/distance that intercurr between these phases.
It may happen that a phase carries a sort of "impulse response", for example it can be the echo of a voice: in this case we will have 1,1,1,1,0,0,1,1,1,0,0,0 as an output, the first sequence ones is the shout we made, while the second one is just the echo cause by the shout.
So I made a function that doesn't take into account the echos/response of the main shout/action, and convert the ones sequence of the echo/response into zeros.
(1) If the sequence of zeros is greater or equal than the input threshold nearby_thr the function will recognize that the sequence of ones is an independent phase and it won't delete or change anything.
(2) If the sequence of zeros (between two sequences of ones) is smaller than the input threshold nearby_thr the function will recognize that we have "an impulse response/echo" and we do not take that into account. Infact it will convert the ones into zeros.
I made a naive function that can accomplish this result but I was wondering if pandas already has a function like that, or if it can be accomplished in few lines, without writing a "C-like" function.
Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
# import utili_funzioni.util00 as ut0
x1 = pd.DataFrame([0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1])
x2 = pd.DataFrame([0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,1,1,0])
# rule = x1==1 ## counting number of consecutive ones
# cumsum_ones = rule.cumsum() - rule.cumsum().where(~rule).ffill().fillna(0).astype(int)
def detect_nearby_el_2(df, nearby_thr):
global el2del
# df = consecut_zeros
# i = 0
print("")
print("")
j = 0
enterOnce_if = 1
reset_count_0s = 0
start2detect = False
count0s = 0 # init
start2_getidxs = False # if this is not true, it won't store idxs to delete
el2del = [] # store idxs to delete elements
for i in range(df.shape[0]):
print("")
print("i: ", i)
x_i = df.iloc[i, 0]
if x_i == 1 and j==0: # first phase (ones) has been detected
start2detect = True # first phase (ones) has been detected
# j += 1
print("count0s:",count0s)
if start2detect == True: # first phase, seen/detected, --> (wait) has ended..
if x_i == 0: # 1st phase detected and ended with "a zero"
if reset_count_0s == 1:
count0s = 0
reset_count_0s = 0
count0s += 1
if enterOnce_if == 1:
start2_getidxs=True # avoiding to delete first phase
enterOnce_0 = 0
if start2_getidxs==True: # avoiding to delete first phase
if x_i == 1 and count0s < nearby_thr:
print("this is NOT a new phase!")
el2del = [*el2del, i] # idxs to delete
reset_count_0s = 1 # reset counter
if x_i == 1 and count0s >= nearby_thr:
print("this is a new phase!") # nothing to delete
reset_count_0s = 1 # reset counter
return el2del
def convert_nearby_el_into_zeros(df,idx):
df0 = df + 0 # error original dataframe is modified!
if len(idx) > 0:
# df.drop(df.index[idx]) # to delete completely
df0.iloc[idx] = 0
else:
print("no elements nearby to delete!!")
return df0
######
print("")
x1_2del = detect_nearby_el_2(df=x1,nearby_thr=3)
x2_2del = detect_nearby_el_2(df=x2,nearby_thr=3)
## deleting nearby elements
x1_a = convert_nearby_el_into_zeros(df=x1,idx=x1_2del)
x2_a = convert_nearby_el_into_zeros(df=x2,idx=x2_2del)
## PLOTTING
# ut0.grayplt()
fig1 = plt.figure()
fig1.suptitle("x1",fontsize=20)
ax1 = fig1.add_subplot(1,2,1)
ax2 = fig1.add_subplot(1,2,2,sharey=ax1)
ax1.title.set_text("PRE-detect")
ax2.title.set_text("POST-detect")
line1, = ax1.plot(x1)
line2, = ax2.plot(x1_a)
fig2 = plt.figure()
fig2.suptitle("x2",fontsize=20)
ax1 = fig2.add_subplot(1,2,1)
ax2 = fig2.add_subplot(1,2,2,sharey=ax1)
ax1.title.set_text("PRE-detect")
ax2.title.set_text("POST-detect")
line1, = ax1.plot(x2)
line2, = ax2.plot(x2_a)
You can see that x1 has two "response/echoes" that I want to not take into account, while x2 has none, infact nothing changed in x2
My question is: How this can be accomplished in few lines using pandas?
Thank You
Interesting problem, and I'm sure there's a more elegant solution out there, but here is my attempt - it's at least fairly performant:
x1 = pd.Series([0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1])
x2 = pd.Series([0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,1,1,0])
def remove_echos(series, threshold):
starting_points = (series==1) & (series.shift()==0)
echo_starting_points = starting_points & series.shift(threshold)==1
echo_starting_points = series[echo_starting_points].index
change_points = series[starting_points].index.to_list() + [series.index[-1]]
for (start, end) in zip(change_points, change_points[1:]):
if start in echo_starting_points:
series.loc[start:end] = 0
return series
x1 = remove_echos(x1, 3)
x2 = remove_echos(x2, 3)
(I changed x1 and x2 to be Series instead of DataFrame, it's easy to adapt this code to work with a df if you need to.)
Explanation: we define the "starting point" of each section as a 1 preceded by a 0. Of those we define an "echo" starting point if the point threshold places before is a 1. (The assumption is that we don't have a phases which is shorter than threshold.) For each echo starting point, we zero from it to the next starting point or the end of the Series.

Python List in List trouble

Before start to tell my problem sorry for my grammar and English is not very good. I'm a Python learner. Today i was working on a project but I have a trouble. I'm trying to make a loop.
coordinates = [[1,2],[2,3],[3,5],[5,6],[7,7`],[1,2]]
Here is my list, I'm trying to create a loop. That loop will substract every first values from each others and every seconds to seconds then print. Let me explain my trouble more simple. [[x,y][x1,y1][x2,y2] I need to substract x1-x then print the result after this x2-x1 then print the result but same time y1-y print then print so console output should looks like this;
1,1
1,2
2,1...
Method i've tried
while True:
for x,y in coordinates:
x = x - y
print(x)
This is not worked because it substracts x values to y values. I know it's too wrong.
I've research on internet but i did not understand this subject very well.
I'm looking for help. Thanks everyone.
A simple and naive implementation
def pr(arr):
i = 1
while i < len(arr):
(x,y) = arr[i]
(a,b) = arr[i-1]
print(x-a, y-b)
i += 1
if __name__ == '__main__':
coordinates = [[1,2],[2,3],[3,5],[5,6],[7,7],[1,2]]
pr(coordinates)
O/P:
1 1
1 2
2 1
2 1
-6 -5
This is fairly similar to your original code:
coordinates = [[1,2],[2,3],[3,5],[5,6],[7,7`],[1,2]]
x_prev = None
for x, y in coordinates:
if x_prev is not None:
print('{}, {}'.format(x - x_prev, y - y_prev)
x_prev, y_prev = x, y
If you want to generalize a bit, for different lengths of coordinates, you could do this:
coordinates = [[1,2],[2,3],[3,5],[5,6],[7,7`],[1,2]]
prev = None
for c in coordinates:
if prev is not None:
print(', '.join(c2-c1 for c1, c2 in zip(prev, c)))
prev = c
You need to iterate over the list using range function so that you can get current and next ones together. So you can do the subtraction in the loop.
coordinates = [[1,2],[2,3],[3,5],[5,6],[7,7],[1,2]]
for i in range(len(coordinates) - 1):
print(coordinates[i+1][0] - coordinates[i][0], coordinates[i+1][1] - coordinates[i][1])

Python Creating recursive function / loop for a function that Input/output Lists

I am trying to simulate a biological gene network by updating the probabilities of the genes for each time step in Python. Then, the results for each time step will be plotted. I can do it by coping and pasting over and over again, but it is not ideal once the time steps becomes larger than 10. Here are what I have exactly done so far.
Pgt1, Pkr1, Pkni1, Phb1 = test.simUpdateFun(PgtI, PkrI, PkniI, PhbI)
Pgt2, Pkr2, Pkni2, Phb2 = test.simUpdateFun(Pgt1, Pkr1, Pkni1, Phb1)
Pgt3, Pkr3, Pkni3, Phb3 = test.simUpdateFun(Pgt2, Pkr2, Pkni2, Phb2)
Pgt4, Pkr4, Pkni4, Phb4 = test.simUpdateFun(Pgt3, Pkr3, Pkni3, Phb3)
Pgt5, Pkr5, Pkni5, Phb5 = test.simUpdateFun(Pgt4, Pkr4, Pkni4, Phb4)
data.dataPlot(PgtI, PkrI, PkniI, PhbI)
data.dataPlot(Pgt1, Pkr1, Pkni1, Phb1)
data.dataPlot(Pgt2, Pkr2, Pkni2, Phb2)
data.dataPlot(Pgt3, Pkr3, Pkni3, Phb3)
data.dataPlot(Pgt4, Pkr4, Pkni4, Phb4)
data.dataPlot(Pgt5, Pkr5, Pkni5, Phb5)
simUpdateFun is a function I wrote within a class to implement the genes' interaction and update the probabilities. Also, the data structure for each variable is a list that contain around 20 data points.
At first I am thinking about doing a recursion function for the update. Unfortunately, my knowledge in Python is rather limited (self-taught during a summer), and all I know of recursion function in Python are simple cases such as factorial function and Fibonacci sequences. The biggest problem for me is not being able to go around writing a loop or even the recursive function given the input and output for the simUpdateFun are lists.
The simUpdateFun is as follow:
def simUpdateFun(self, PgtPre, PkrPre, PkniPre, PhbPre):
"""A function to update the gap gene probabilities based on the mutual repression relationship from the previous ones"""
PgtS = []
PkrS = []
PkniS = []
PhbS = []
#PgtS = list(i for i in PgtPre)
# implement the mutual strong repression interaction
for i in range(self.xlen):
PgtS.append(1-PkrPre[i])
PkrS.append(1-PgtPre[i])
PkniS.append(1-PhbPre[i])
PhbS.append(1-PkniPre[i])
# implementation of the weak overlapping repression interaction
PgtR = []
PkrR = []
PkniR = []
PhbR = []
x = len(PgtPre)
P1 = 0
P2 = int(x/4 )
P3 = int(x * 2/4 )
P4 = int(x * 3/4 )
P5 = int(x )
# first calculate the repressor probability function for the gap genes
for i in PgtPre:
PgtR.append(self.repressor(i, 1))
for i in PkrPre:
PkrR.append(self.repressor(i, 1))
for i in PkniPre:
PkniR.append(self.repressor(i,1))
for i in PhbPre:
PhbR.append(self.repressor(i, 1))
# implement the interactions of weak repression of overlapping genes
# Try to avoid alternating the initial condition values by copy a new set of list objects.
PgtW = list(i for i in PgtPre) # using generator expression simplified the codes and also minimize # of lines
PkrW = list(i for i in PkrPre)
PkniW = list(i for i in PkniPre)
PhbW= list(i for i in PhbPre)
for i in range(P4, P5): # qudrant 4 regulation
PgtW[i] = PgtPre[i] * PhbR[i]
for i in range(P3, P4): # qudrant 3 regulation
PkniW[i] = PkniPre[i] * PgtR[i]
for i in range(P2, P3): # qudrant 2 regulation
PkrW[i] = PkrPre[i] * PkniR[i]
for i in range(P1, P2): # qudrant 1 regulation
PhbW[i] = PhbPre[i] * PkrR[i]
PkrW[i] = PkrPre[i] * PhbR[i]
# determinate the final probabilites of the two effects by multiplying since they take place simultatesily.
Pgtf = []
Pkrf = []
Pknif = []
Phbf = []
for i in range(len(PgtPre)):
Pgtf.append(PgtS[i] * PgtW[i])
Pkrf.append(PkrS[i] * PkrW[i])
Pknif.append(PkniS[i] * PkniW[i])
Phbf.append(PhbS[i] * PhbW[i])
return(Pgtf, Pkrf, Pknif, Phbf)
The function basically takes in a sets of list data which are probability values, and outputs the updated version of the lists.

Loop and extract window

I am trying to create a function (or series of functions), that perform the following operations:
Having an input array(A), for each cell A[i,j], extract a window (W), of custom size, where the value 'min' will be:
min = np.min(W)
The output matrix (H) will store the values as:
H[i,j] = A[i,j] - min(W)
For an easier understanding of the issue, I attached a picture (Example):
My current code is this:
def res_array(matrix, size):
result = []
sc.generic_filter(matrix, nothing, size, extra_arguments=(result,), mode = 'nearest')
mat_out = result
return mat_out
def local(window):
H = np.empty_like(window)
w = res_array(window, 3)
win_min = np.apply_along_axis(min, 1, w)
# This is where I think it's broken
for k in win_min:
for i in range(window.shape[0]):
for j in range(window.shape[1]):
h[i, j] = window[i,j] - k
k += 1
return h
def nothing(window, out):
list = []
for i in range(window.shape[0]):
list.append(window[i])
out.append(list)
return 0
test = np.ones((10, 10)) * np.arange(10)
a = local(test)
I need the code to pass to the next value in 'for k in win_min', for each cell of the input matrix A, or test.
Edit: I thought of something like directly accessing the index of the 'win_min', and increment by one, like I saw here: Increment the value inside a list element, but I don't know how to do that.
Thanks for any help!
N=4 #matrix size
a=random((N,N)) #input
#--window size
wl=1 #left
wr=1 #right
wt=1 #top
wb=1 #bottom
#---
H=np.zeros((N,N)) #output
def h(k,l): #individual cell function
#--- checks to not run out of array
k1=max(k-wt,0)
k2=min(k+wb+1,N)
l1=max(l-wl,0)
l2=min(l+wr,N)
#---
return a[k,l]-np.amin(a[k1:k2,l1:l2])
H=array([[h(k,l) for l in range(N)] for k in range(N)]) #running over all matrix elements
print a
print H

Query long lists

I would like to query the value of an exponentially weighted moving average at particular points. An inefficient way to do this is as follows. l is the list of times of events and queries has the times at which I want the value of this average.
a=0.01
l = [3,7,10,20,200]
y = [0]*1000
for item in l:
y[int(item)]=1
s = [0]*1000
for i in xrange(1,1000):
s[i] = a*y[i-1]+(1-a)*s[i-1]
queries = [23,68,103]
for q in queries:
print s[q]
Outputs:
0.0355271185019
0.0226018371526
0.0158992102478
In practice l will be very large and the range of values in l will also be huge. How can you find the values at the times in queries more efficiently, and especially without computing the potentially huge lists y and s explicitly. I need it to be in pure python so I can use pypy.
Is it possible to solve the problem in time proportional to len(l)
and not max(l) (assuming len(queries) < len(l))?
Here is my code for doing this:
def ewma(l, queries, a=0.01):
def decay(t0, x, t1, a):
from math import pow
return pow((1-a), (t1-t0))*x
assert l == sorted(l)
assert queries == sorted(queries)
samples = []
try:
t0, x0 = (0.0, 0.0)
it = iter(queries)
q = it.next()-1.0
for t1 in l:
# new value is decayed previous value, plus a
x1 = decay(t0, x0, t1, a) + a
# take care of all queries between t0 and t1
while q < t1:
samples.append(decay(t0, x0, q, a))
q = it.next()-1.0
# take care of all queries equal to t1
while q == t1:
samples.append(x1)
q = it.next()-1.0
# update t0, x0
t0, x0 = t1, x1
# take care of any remaining queries
while True:
samples.append(decay(t0, x0, q, a))
q = it.next()-1.0
except StopIteration:
return samples
I've also uploaded a fuller version of this code with unit tests and some comments to pastebin: http://pastebin.com/shhaz710
EDIT: Note that this does the same thing as what Chris Pak suggests in his answer, which he must have posted as I was typing this. I haven't gone through the details of his code, but I think mine is a bit more general. This code supports non-integer values in l and queries. It also works for any kind of iterables, not just lists since I don't do any indexing.
I think you could do it in ln(l) time, if l is sorted. The basic idea is that the non recursive form of EMA is a*s_i + (1-a)^1 * s_(i-1) + (1-a)^2 * s_(i-2) ....
This means for query k, you find the greatest number in l less than k, and for a estimation limit, use the following, where v is the index in l, l[v] is the value
(1-a)^(k-v) *l[v] + ....
Then, you spend lg(len(l)) time in search + a constant multiple for the depth of your estimation. I'll provide a code sample in a little bit (after work) if you want it, just wanted to get my idea out there while I was thinking about it
here's the code -
v is the dictionary of values at a given time; replace with 1 if it's just a 1 every time...
import math
from bisect import bisect_right
a = .01
limit = 1000
l = [1,5,14,29...]
def find_nearest_lt(l, time):
i = bisect_right(a, x)
if i:
return i-1
raise ValueError
def find_ema(l, time):
i = find_nearest_lt(l, time)
if l[i] == time:
result = a * v[l[i]
i -= 1
else:
result = 0
while (time-l[i]) < limit:
result += math.pow(1-a, time-l[i]) * v[l[i]]
i -= 1
return result
if I'm thinking correctly, the find nearest is l(n), then the while loop is <= 1000 iterations, guaranteed, so it's technically a constant (though a kind of large one). find_nearest was stolen from the page on bisect - http://docs.python.org/2/library/bisect.html
It appears that y is a binary value -- either 0 or 1 -- depending on the values of l. Why not use y = set(int(item) for item in l)? That's the most efficient way to store and look up a list of numbers.
Your code will cause an error the first time through this loop:
s = [0]*1000
for i in xrange(1000):
s[i] = a*y[i-1]+(1-a)*s[i-1]
because i-1 is -1 when i=0 (first pass of loop) and both y[-1] and s[-1] are the last element of the list, not the previous. Maybe you want xrange(1,1000)?
How about this code:
a=0.01
l = [3.0,7.0,10.0,20.0,200.0]
y = set(int(item) for item in l)
queries = [23,68,103]
ewma = []
x = 1 if (0 in y) else 0
for i in xrange(1, queries[-1]):
x = (1-a)*x
if i in y:
x += a
if i == queries[0]:
ewma.append(x)
queries.pop(0)
When it's done, ewma should have the moving averages for each query point.
Edited to include SchighSchagh's improvements.

Categories

Resources