Pandas/Numpy: Calculate current state series based on binary signals - python

I have 2 timeseries of binary "signals", let's call them "entry" and "stay".
Entry==1 means add 1 to current state (for some maximum amount of time) and stay==0 means set current state to 0.
entry:
0
1
1
0
1
0
stay:
1
1
1
1
0
1
My code now calculates a combined current state:
state:
0
1
2
2
0
1
Currently I use the following code, unfortunately it's (depending on the max-time) quite slow (state/stay/entry are Pandas time series):
state=copy.deepcopy(entry)
state[stay==0]=0
#first iteration
state[(entry.shift(1)==1) & (stay==1)]+=1
#2nd iteration to max time
for lag in range(2,max_time+1):
state[(entry.shift(lag)==1) & (pd.rolling_mean(stay,lag)==1)]+=1
Any idea how to vectorize this code for better performance? Many thanks!

Finally found a solution now, using some NumPy functions:
def calc_state_series(entry,stay, max_time=5):
reduce=(copy.deepcopy(entry)*0).fillna(0) #just for initalization
reduce[(entry.shift(max_time)==1) & (pd.rolling_mean(stay,max_time)==1)]-=1
entry=(entry+stay.shift(1)).fillna(0) #reduce state after max_time
x=entry.values
x = np.concatenate(([0], x))
y=stay.values
y=np.concatenate(([0], y))
nans = y==0
x = np.array(x)
x[nans] = 0
reset_idx = np.zeros(len(x), dtype=int)
reset_idx[nans] = np.arange(len(x))[nans]
reset_idx = np.maximum.accumulate(reset_idx)
cumsum = np.cumsum(x)
cumsum = cumsum - cumsum[reset_idx]
return pd.Series(cumsum[1:], index=entry.index)
I manage to avoid the loop and this solution is (depending on max_time) up to 100x faster for me - but there is probably still potential for further optimization.

Related

Return a pandas series from a loop

I have a pandas dataframe nike that looks like this:
rise1 run1 position
1 0.82 1
3 1.64 2
5 3.09 3
7 5.15 4
8 7.98 5
15 11.12 6
I am trying to make a function that calculates grade (rise/run) and returns it as a pandas series. I want to use X points ahead of the current position minus X points behind the current position to calculate grade (i.e. if X = 2, the grade at position 4 is (15-3)/(11.12-1.64)).
def get_grade(dataf, X=n):
grade = pd.Series(data = None, index = range(dataf.shape[0]))
for i in range(X, dataf.shape[0] - X):
rise = dataf.loc[i + X, 'rise1'] - dataf.loc[i - X,'rise1']
run = dataf.loc[i + X, 'run1'] - dataf.loc[i - X, 'run1']
if np.isclose(rise, 0) or np.isclose(run, 0):
grade[i] = 0
elif rise / run > 1:
grade[i] = 1
elif rise / run < -1:
grade[i] = -1
else:
grade[i] = rise / run
return grade
get_grade(nike, X= 2)
When I call the function, nothing happens. The code executes but nothing appears. What might I be doing wrong? Apologies if this is unclear, I am very new to coding in general with limited vocab in this area.
You have to set a variable equal to the function (so setting the variable equal to your return value) and then print/display that variable. Like df = get_grade(nike, X= 2) print(df). Or put a print call in your function
def test_function():
df = pd.DataFrame({"col1":[1,2,3,4], "col2":[4,3,2,1]})
return df
df = test_function()
print(df)
Or
def test_print_function():
df = pd.DataFrame({"col1":[1,2,3,4], "col2":[4,3,2,1]})
print(df)
test_print_function()
The way you are working is suboptimal. In general, a for loop + .loc in pandas repeatedly is a signal that you're not taking advantage for the framework.
My suggestion is to use a rolling window, and apply your calculations:
WINDOW = 2
rolled = df[['rise1', 'run1']].rolling(2*WINDOW + 1, center=True)\
.apply(lambda s: s.iloc[0] - s.iloc[-1])
print(rolled['rise1'] / rolled['run1'])
0 NaN
1 NaN
2 0.977654
3 1.265823
4 NaN
5 NaN
dtype: float64
Now, as to your specific problem, I cannot reproduce. Copying and pasting your code in a brand new notebook works fine, but apparently it doesn't yield the results you want (i.e. you don't find (15-3)/(11.12-1.64) as you intended).

Chunk a variable into parts and sum the total in each part

My dataset has 2 million observations. I want to split it into 200 categories based on the value of a variable, 'rv'. For example, imagine I had the categories 0-1000, 1000-2000, 2000-3000, 3000-4000, 4000-5000 I would want to split an observation with value 4500 like this: 1000 in each of the 1st 4 categories, and 500 in the final category. I have the following code, which works but is very slow:
# create random data set
import pandas as pd
import numpy as np
data = np.random.randint(0, 5000, size=2000)
df = pd.DataFrame({'rv': data})
#%% slice
sizes = [0, 1000, 2000, 3000, 4000, 5000]
size_names = ['{:.0f} to {:.0f}'.format(lower, upper) for lower, upper in zip(sizes[0:-1], sizes[1:])]
for lower, upper, name in zip(sizes[0:-1], sizes[1:], size_names):
df[name] = df['rv'].apply(lambda x: max(0, (min(x, upper) - lower)))
# summary table
df_slice = df[size_names].sum()
Are there better ways of doing this, where better means faster principally? With 2 million observations and 200 categories this takes quite a long time (not sure how long as I stopped the code before it had finished).
I wrote an algorithm that sorts the data beforehand, which takes it from a O(n*m) loop (over the data and the categories) to a O(n) loop (just over the data, albeit there is a O(n log n) time for sorting it). By sorting it, you already know which bin you're in and just have to take care of the summing for that particular bin, then apply the sum to that bin and all bins below it once per bin. It takes about 1.2 seconds on 2 million data points over 200 categories. Hope it helps:
from time import time
from random import randint
data = [randint(0, 4999) for i in range(2000000)]
sizes = range(0, 5001, 25)
bound_pairs = [[sizes[i], sizes[i + 1]] for i in range(len(sizes) - 1)]
results = [0 for i in range(len(sizes) - 1)]
data.sort()
curr_bin = 0
curr_bin_count = 0
curr_bin_sum = 0
for d in data:
if d >= bound_pairs[curr_bin][1]:
results[curr_bin] += curr_bin_sum
for i in range(curr_bin):
results[i] += curr_bin_count * (bound_pairs[i][1] - bound_pairs[i][0])
curr_bin_count = 0
curr_bin_sum = 0
while d >= bound_pairs[curr_bin][1]:
curr_bin += 1
curr_bin_count += 1
curr_bin_sum += d - bound_pairs[curr_bin][0]
results[curr_bin] += curr_bin_sum
for i in range(curr_bin):
results[i] += curr_bin_count * (bound_pairs[i][1] - bound_pairs[i][0])
EDIT: There may be some issues here depending on whether you want the upper bound or lower bound to be inclusive or exclusive. I leave the particulars to you.

Obtain delays array from timestamp array

I want to calculate the variance of delay arrivals between signals. Each time a signal comes, a timestamp is registered in the 'time' field of the Logs table of my SQLite database. So I solve the problem the following way:
cursor.execute('SELECT time FROM Logs')
rows = cursor.fetchall()
x = numpy.array(rows[:-1])
y = numpy.array(rows[1:])
z = y - x
print "Var = ", z.var()
That gives me the correct value. But... the solution uses two numpy arrays (z stores the delay between one signal and the previous, to be sure: len(z) = len(y)-1 ). I wonder if there is a "numpy" elegant way to do this with only one array, and without iterate over all rows.
I think you're looking for the np.diff function.
import numpy as np
# example data
rows = np.r_[:10]
z = rows[1:] - rows[:-1]
print(z)
#[1 1 1 1 1 1 1 1 1]
z = np.diff(rows)
print(z)
#[1 1 1 1 1 1 1 1 1]

Storing all values when creating a Pandas Pivot Table

Basically, I'm aggregating prices over three indices to determine: mean, std, as well as an upper/lower limit. So far so good. However, now I want to also find the lowest identified price which is still >= the computed lower limit.
My first idea was to use np.min to find the lowest price -> this obviously disregards the lower-limit and is not useful. Now I'm trying to store all the values the pivot table identified to find the price which still is >= lower-limit. Any ideas?
pivot = pd.pivot_table(temp, index=['A','B','C'],values=['price'], aggfunc=[np.mean,np.std],fill_value=0)
pivot['lower_limit'] = pivot['mean'] - 2 * pivot['std']
pivot['upper_limit'] = pivot['mean'] + 2 * pivot['std']
First, merge pivoted[lower_limit] back into temp. Thus, for each price in temp there is also a lower_limit value.
temp = pd.merge(temp, pivoted['lower_limit'].reset_index(), on=ABC)
Then you can restrict your attention to those rows in temp for which the price is >= lower_limit:
temp.loc[temp['price'] >= temp['lower_limit']]
The desired result can be found by computing a groupby/min:
result = temp.loc[temp['price'] >= temp['lower_limit']].groupby(ABC)['price'].min()
For example,
import numpy as np
import pandas as pd
np.random.seed(2017)
N = 1000
ABC = list('ABC')
temp = pd.DataFrame(np.random.randint(2, size=(N,3)), columns=ABC)
temp['price'] = np.random.random(N)
pivoted = pd.pivot_table(temp, index=['A','B','C'],values=['price'],
aggfunc=[np.mean,np.std],fill_value=0)
pivoted['lower_limit'] = pivoted['mean'] - 2 * pivoted['std']
pivoted['upper_limit'] = pivoted['mean'] + 2 * pivoted['std']
temp = pd.merge(temp, pivoted['lower_limit'].reset_index(), on=ABC)
result = temp.loc[temp['price'] >= temp['lower_limit']].groupby(ABC)['price'].min()
print(result)
yields
A B C
0 0 0 0.003628
1 0.000132
1 0 0.005833
1 0.000159
1 0 0 0.006203
1 0.000536
1 0 0.001745
1 0.025713

Issues with using np.linalg.solve in Python

Below, I'm trying to code a Crank-Nicholson numerical solution to the Navier-Stokes equation for momentum (simplified with placeholders for time being), but am having issues with solving for umat[timecount,:], and keep getting the error "ValueError: setting an array element with a sequence". I'm extremely new to Python, does anyone know what I could do differently to avoid this problem?
Thanks!!
def step(timesteps,dn,dt,Numvpts,Cd,g,alpha,Sl,gamma,theta_L,umat):
for timecount in range(0, timesteps+1):
if timecount == 0:
umat[timecount,:] = 0
else:
Km = 1 #placeholder for eddy viscosity
thetaM = 278.15 #placeholder for theta_m for time being
A = Km*dt/(2*(dn**2))
B = (-g*dt/theta_L)*thetaM*np.sin(alpha)
C = -dt*(1/(2*Sl) + Cd)
W_arr = np.zeros(Numvpts+1)
D = np.zeros(Numvpts+1)
for x in (0,Numvpts): #creating the vertical veocity term
if x==0:
W_arr[x] = 0
D[x] = 0
else:
W_arr[x] = W_arr[x-1] - (dn/Sl)*umat[timecount-1,x-1]
D = W_arr/(4*dn)
coef_mat_u = Neumann_mat(Numvpts,D-A,(1+2*A),-(A+D))
b_arr_u = np.zeros(Numvpts+1) #the array of known quantities
umat_forward = umat[timecount-1,2:Numvpts]
umat_center = umat[timecount-1,1:Numvpts-1]
umat_backward = umat[timecount-1,0:Numvpts-2]
b_arr_u = np.zeros(Numvpts+1)
for j in (0,Numvpts):
if j==0:
b_arr_u[j] = 0
elif j==Numvpts:
b_arr_u[j] = 0
else:
b_arr_u[j] = (A+D[j])*umat_backward[j]*(1-2*A)*umat_center[j] + (A-D[j])*umat_forward[j] - C*(umat_center[j]*umat_center[j]) - B
umat[timecount,:] = np.linalg.solve(coef_mat_u,b_arr_u)
return(umat)
Please note that,
for i in (0, 20):
print(i),
will give result 0 20 not 0 1 2 3 4 ... 20
So you have to use the range() function
for i in range(0, 20 + 1):
print(i),
to get 0 1 2 3 4 ... 20
I have not gone through your code rigorously, but I think the problem is in your two inner for loops:
for x in (0,Numvpts): #creating the vertical veocity term
which is setting values only at zero th and (Numvpts-1) th index. I think you must use
for x in range(0,Numvpts):
Similar is the case in (range() must be used):
for j in (0,Numvpts):
Also, here j never becomes == Numvpts, but you are checking the condition? I guess it must be == Numvpts-1
And also the else condition is called for every index other than 0? So in your code the right hand side vector has same numbers from index 1 onwards!
I think the fundamental problem is that you are not using range(). Also it is a good idea to solve the NS eqns for a small grid and manually check the A and b matrix to see whether they are being set correctly.

Categories

Resources