Return a pandas series from a loop - python

I have a pandas dataframe nike that looks like this:
rise1 run1 position
1 0.82 1
3 1.64 2
5 3.09 3
7 5.15 4
8 7.98 5
15 11.12 6
I am trying to make a function that calculates grade (rise/run) and returns it as a pandas series. I want to use X points ahead of the current position minus X points behind the current position to calculate grade (i.e. if X = 2, the grade at position 4 is (15-3)/(11.12-1.64)).
def get_grade(dataf, X=n):
grade = pd.Series(data = None, index = range(dataf.shape[0]))
for i in range(X, dataf.shape[0] - X):
rise = dataf.loc[i + X, 'rise1'] - dataf.loc[i - X,'rise1']
run = dataf.loc[i + X, 'run1'] - dataf.loc[i - X, 'run1']
if np.isclose(rise, 0) or np.isclose(run, 0):
grade[i] = 0
elif rise / run > 1:
grade[i] = 1
elif rise / run < -1:
grade[i] = -1
else:
grade[i] = rise / run
return grade
get_grade(nike, X= 2)
When I call the function, nothing happens. The code executes but nothing appears. What might I be doing wrong? Apologies if this is unclear, I am very new to coding in general with limited vocab in this area.

You have to set a variable equal to the function (so setting the variable equal to your return value) and then print/display that variable. Like df = get_grade(nike, X= 2) print(df). Or put a print call in your function
def test_function():
df = pd.DataFrame({"col1":[1,2,3,4], "col2":[4,3,2,1]})
return df
df = test_function()
print(df)
Or
def test_print_function():
df = pd.DataFrame({"col1":[1,2,3,4], "col2":[4,3,2,1]})
print(df)
test_print_function()

The way you are working is suboptimal. In general, a for loop + .loc in pandas repeatedly is a signal that you're not taking advantage for the framework.
My suggestion is to use a rolling window, and apply your calculations:
WINDOW = 2
rolled = df[['rise1', 'run1']].rolling(2*WINDOW + 1, center=True)\
.apply(lambda s: s.iloc[0] - s.iloc[-1])
print(rolled['rise1'] / rolled['run1'])
0 NaN
1 NaN
2 0.977654
3 1.265823
4 NaN
5 NaN
dtype: float64
Now, as to your specific problem, I cannot reproduce. Copying and pasting your code in a brand new notebook works fine, but apparently it doesn't yield the results you want (i.e. you don't find (15-3)/(11.12-1.64) as you intended).

Related

add a column to a pandas.dataframe that holds the index of the closest point with a certain condition

I have a huge number of points stored with x and y coordinates and an additional value ('value_P') in a pandas.dataframe so the dataframe looks like:
x-coordinate
y-coordinate
value_P
0
0
3
1
1
40
58
1
2
5
4
2
3
76
98
2
4
15
35
3
5
5
4
3
but with around 250000 entries, so i look for a efficient solution. I am trying to add a column that holds the row index of the closest other point. But only the distance between points with value_P!=1 to points with value_P==1 should be considered. Also i am only interested in the index for points where value_P!=1. Its difficult to explain but the desired output should be:
x-coordinate
y-coordinate
value_P
index
0
0
3
1
NaN
1
40
58
1
NaN
2
5
4
2
0
3
76
98
2
1
4
15
35
3
1
5
5
4
3
0
For row 1 the index is NaN because i am not interested in it, since value_P==1. For row 2 its 0, because the point from row 0 is the closest point with a value_P of 1.
I hope its understandable.
I found a solution that involves 2 DataFrame.apply(lambda x:...) functions but it takes a long time. Even if you dont have a concrete solution but an idea how to improve the performance it would be highly appreciated.
My current code is: (P_sort is the data and 'zuord' is the added column)
def index2(x_1,y_1,x_2,y_2,last_1):
h = math.sqrt((x_1 - x_2) ** 2 + (y_1 - y_2) ** 2)
return h
def index(x_1,y_1,x_v,y_v,last_1):
df2 = pnd.DataFrame()
df3 = pnd.DataFrame()
df2['x-coordinate'] = x_v
df2['y-coordinate'] = y_v
df3['distances'] = df2.apply(
lambda x: index2(x['x-coordinate'], x['y-coordinate'], x_1, y_1, last_1), axis=1)
k=df3.idxmin()
print(k)
return k
last_1 = np.count_nonzero(P_sort[:, 2] == 1) - 1
df = pnd.DataFrame(P_sort,
columns=['x-coordinate', 'y-coordinate', 'value_P'])
number_columnx = df.loc[:, 'x-coordinate']
number_columny = df.loc[:, 'y-coordinate']
x_v = number_columnx.values
y_v = number_columny.values
x_v = x_v[0:last_1]
y_v = y_v[0:last_1]
df['zuord'] = df.apply(lambda x: index(x['x-coordinate'],x['y-coordinate'],x_v,y_v,last_1),axis=1)
I am new to programming so the code is kind of ugly
I benchmarked four solutions, and the fastest approach is a KD Tree.
Test Dataset
I randomly generated dataframes of various sizes to test the performance of each method.
def generate_spots(n, p=0.005):
x_pos = np.random.uniform(0, 100, n)
y_pos = np.random.uniform(0, 100, n)
value_P = np.random.binomial(size=n, n=1, p=(1 - p)) + 1
df = pd.DataFrame({
'x-coordinate': x_pos,
'y-coordinate': y_pos,
'value_P': value_P
})
df = df.sort_values('value_P').reset_index(drop=True)
return df
This generates a dataframe with n rows, with a probability p that each row is class 1. I also sorted it, because the original method seems to assume that the dataframe is sorted by P.
Method 1: Original
I made some small changes to your code to get it to work for me:
def method1(df):
df = df.copy()
last_1 = np.count_nonzero(df.loc[:, 'value_P'] == 1)
number_columnx = df.loc[:, 'x-coordinate']
number_columny = df.loc[:, 'y-coordinate']
x_v = number_columnx.values
y_v = number_columny.values
x_v = x_v[0:last_1]
y_v = y_v[0:last_1]
df['index'] = df.apply(lambda x: index(x['x-coordinate'],x['y-coordinate'],x_v,y_v,last_1),axis=1)
df.loc[0:last_1 - 1, 'index'] = -1
return df
index() and index2() are defined the same way as your question. I also use -1 as a placeholder instead of NaN. No deep reason for this, just personal preference.
Method 2: cdist
Scipy has a function called cdist() which takes the distance between each point among two arrays of points.
import scipy.spatial.distance
def method2(df):
df = df.copy()
first_P_class = df['value_P'] == 1
target_df = df.loc[first_P_class][['x-coordinate', 'y-coordinate']]
source_df = df.loc[~first_P_class][['x-coordinate', 'y-coordinate']]
nearest_point = scipy.spatial.distance.cdist(source_df, target_df).argmin(axis=1)
df['index'] = -1
df.loc[source_df.index, 'index'] = nearest_point
return df
The cdist function is pretty much the same as what you're doing - it's just implemented in C rather than Python.
Method 3: KD Tree
A KD Tree is a data structure designed to efficiently search for nearby points. You can use SciKit Learn to implement this.
import sklearn.neighbors
def method3(df):
df = df.copy()
first_P_class = df['value_P'] == 1
target_df = df.loc[first_P_class][['x-coordinate', 'y-coordinate']]
source_df = df.loc[~first_P_class][['x-coordinate', 'y-coordinate']]
tree = sklearn.neighbors.KDTree(target_df)
nearest_point = tree.query(source_df, k=1, return_distance=False)
df['index'] = -1
df.loc[source_df.index, 'index'] = nearest_point.flatten()
return df
Method 4: fastdist
The Python package fastdist bills itself as a faster alternative to scipy's distance calculation methods. Ironically, I found this solution to be slower than cdist at all problem sizes.
from fastdist import fastdist
def method4(df):
df = df.copy()
first_P_class = df['value_P'] == 1
target_df = df.loc[first_P_class][['x-coordinate', 'y-coordinate']]
target_array = target_df.to_numpy()
source_df = df.loc[~first_P_class][['x-coordinate', 'y-coordinate']]
source_array = source_df.to_numpy()
nearest_point = fastdist.matrix_to_matrix_distance(source_array, target_array, fastdist.euclidean, "euclidean").argmin(axis=1)
df['index'] = -1
df.loc[source_df.index, 'index'] = nearest_point
return df
Benchmarks
Each method was run ten times, with various sizes of dataframe, in random order. Here are the results of the benchmark. Note that both the X and Y axes are log-scale.
I didn't benchmark fastdist or the original method for more than 30,000 points, because it took too long.
The fastest methods, in this benchmark, are the cdist method, for fewer than 1000 points, and KD Tree method, for more than 1000 points. At 250K points, the fastest solution is the KD Tree, taking only 0.2 seconds.

using previous row value by looping through index conditioning

If i have dataframe with column x.
I want to make a new column x_new but I want the first row of this new column to be set to a specific number (let say -2).
Then from 2nd row, use the previous row to iterate through the cx function
data = {'x':[1,2,3,4,5]}
df=pd.DataFrame(data)
def cx(x):
if df.loc[1,'x_new']==0:
df.loc[1,'x_new']= -2
else:
x_new = -10*x + 2
return x_new
df['x_new']=(cx(df['x']))
The final dataframe
I am not sure on how to do this.
Thank you for your help
This is what i have so far:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
df
# calculate equation
def depth_cal(d):
z = -3*d+1 #d must be previous row
return z
depth_cal=(depth_cal(df['depth'])) # how to set d as previous row
print (depth_cal)
depth_new =[]
for row in df['depth']:
if row == 1:
depth_new.append('-5.63')
else:
depth_new.append(depth_cal) #Does not put list in a column
df['Depth_correct']= depth_new
correct output:
There is still two problem with this:
1. it does not put the depth_cal list properly in column
2. in the depth_cal function, i want d to be the previous row
Thank you
I would do this by just using a loop to generate your new data - might not be ideal if particularly huge but it's a quick operation. Let me know how you get on with this:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
res = data['depth']
res[0] = -5.63
for i in range(1, len(res)):
res[i] = -3 * res[i-1] + 1
df['new_depth'] = res
print(df)
To get
depth new_depth
0 1 -5.63
1 2 17.89
2 3 -52.67
3 4 159.01
4 5 -476.03

Storing all values when creating a Pandas Pivot Table

Basically, I'm aggregating prices over three indices to determine: mean, std, as well as an upper/lower limit. So far so good. However, now I want to also find the lowest identified price which is still >= the computed lower limit.
My first idea was to use np.min to find the lowest price -> this obviously disregards the lower-limit and is not useful. Now I'm trying to store all the values the pivot table identified to find the price which still is >= lower-limit. Any ideas?
pivot = pd.pivot_table(temp, index=['A','B','C'],values=['price'], aggfunc=[np.mean,np.std],fill_value=0)
pivot['lower_limit'] = pivot['mean'] - 2 * pivot['std']
pivot['upper_limit'] = pivot['mean'] + 2 * pivot['std']
First, merge pivoted[lower_limit] back into temp. Thus, for each price in temp there is also a lower_limit value.
temp = pd.merge(temp, pivoted['lower_limit'].reset_index(), on=ABC)
Then you can restrict your attention to those rows in temp for which the price is >= lower_limit:
temp.loc[temp['price'] >= temp['lower_limit']]
The desired result can be found by computing a groupby/min:
result = temp.loc[temp['price'] >= temp['lower_limit']].groupby(ABC)['price'].min()
For example,
import numpy as np
import pandas as pd
np.random.seed(2017)
N = 1000
ABC = list('ABC')
temp = pd.DataFrame(np.random.randint(2, size=(N,3)), columns=ABC)
temp['price'] = np.random.random(N)
pivoted = pd.pivot_table(temp, index=['A','B','C'],values=['price'],
aggfunc=[np.mean,np.std],fill_value=0)
pivoted['lower_limit'] = pivoted['mean'] - 2 * pivoted['std']
pivoted['upper_limit'] = pivoted['mean'] + 2 * pivoted['std']
temp = pd.merge(temp, pivoted['lower_limit'].reset_index(), on=ABC)
result = temp.loc[temp['price'] >= temp['lower_limit']].groupby(ABC)['price'].min()
print(result)
yields
A B C
0 0 0 0.003628
1 0.000132
1 0 0.005833
1 0.000159
1 0 0 0.006203
1 0.000536
1 0 0.001745
1 0.025713

Issues with using np.linalg.solve in Python

Below, I'm trying to code a Crank-Nicholson numerical solution to the Navier-Stokes equation for momentum (simplified with placeholders for time being), but am having issues with solving for umat[timecount,:], and keep getting the error "ValueError: setting an array element with a sequence". I'm extremely new to Python, does anyone know what I could do differently to avoid this problem?
Thanks!!
def step(timesteps,dn,dt,Numvpts,Cd,g,alpha,Sl,gamma,theta_L,umat):
for timecount in range(0, timesteps+1):
if timecount == 0:
umat[timecount,:] = 0
else:
Km = 1 #placeholder for eddy viscosity
thetaM = 278.15 #placeholder for theta_m for time being
A = Km*dt/(2*(dn**2))
B = (-g*dt/theta_L)*thetaM*np.sin(alpha)
C = -dt*(1/(2*Sl) + Cd)
W_arr = np.zeros(Numvpts+1)
D = np.zeros(Numvpts+1)
for x in (0,Numvpts): #creating the vertical veocity term
if x==0:
W_arr[x] = 0
D[x] = 0
else:
W_arr[x] = W_arr[x-1] - (dn/Sl)*umat[timecount-1,x-1]
D = W_arr/(4*dn)
coef_mat_u = Neumann_mat(Numvpts,D-A,(1+2*A),-(A+D))
b_arr_u = np.zeros(Numvpts+1) #the array of known quantities
umat_forward = umat[timecount-1,2:Numvpts]
umat_center = umat[timecount-1,1:Numvpts-1]
umat_backward = umat[timecount-1,0:Numvpts-2]
b_arr_u = np.zeros(Numvpts+1)
for j in (0,Numvpts):
if j==0:
b_arr_u[j] = 0
elif j==Numvpts:
b_arr_u[j] = 0
else:
b_arr_u[j] = (A+D[j])*umat_backward[j]*(1-2*A)*umat_center[j] + (A-D[j])*umat_forward[j] - C*(umat_center[j]*umat_center[j]) - B
umat[timecount,:] = np.linalg.solve(coef_mat_u,b_arr_u)
return(umat)
Please note that,
for i in (0, 20):
print(i),
will give result 0 20 not 0 1 2 3 4 ... 20
So you have to use the range() function
for i in range(0, 20 + 1):
print(i),
to get 0 1 2 3 4 ... 20
I have not gone through your code rigorously, but I think the problem is in your two inner for loops:
for x in (0,Numvpts): #creating the vertical veocity term
which is setting values only at zero th and (Numvpts-1) th index. I think you must use
for x in range(0,Numvpts):
Similar is the case in (range() must be used):
for j in (0,Numvpts):
Also, here j never becomes == Numvpts, but you are checking the condition? I guess it must be == Numvpts-1
And also the else condition is called for every index other than 0? So in your code the right hand side vector has same numbers from index 1 onwards!
I think the fundamental problem is that you are not using range(). Also it is a good idea to solve the NS eqns for a small grid and manually check the A and b matrix to see whether they are being set correctly.

Pandas/Numpy: Calculate current state series based on binary signals

I have 2 timeseries of binary "signals", let's call them "entry" and "stay".
Entry==1 means add 1 to current state (for some maximum amount of time) and stay==0 means set current state to 0.
entry:
0
1
1
0
1
0
stay:
1
1
1
1
0
1
My code now calculates a combined current state:
state:
0
1
2
2
0
1
Currently I use the following code, unfortunately it's (depending on the max-time) quite slow (state/stay/entry are Pandas time series):
state=copy.deepcopy(entry)
state[stay==0]=0
#first iteration
state[(entry.shift(1)==1) & (stay==1)]+=1
#2nd iteration to max time
for lag in range(2,max_time+1):
state[(entry.shift(lag)==1) & (pd.rolling_mean(stay,lag)==1)]+=1
Any idea how to vectorize this code for better performance? Many thanks!
Finally found a solution now, using some NumPy functions:
def calc_state_series(entry,stay, max_time=5):
reduce=(copy.deepcopy(entry)*0).fillna(0) #just for initalization
reduce[(entry.shift(max_time)==1) & (pd.rolling_mean(stay,max_time)==1)]-=1
entry=(entry+stay.shift(1)).fillna(0) #reduce state after max_time
x=entry.values
x = np.concatenate(([0], x))
y=stay.values
y=np.concatenate(([0], y))
nans = y==0
x = np.array(x)
x[nans] = 0
reset_idx = np.zeros(len(x), dtype=int)
reset_idx[nans] = np.arange(len(x))[nans]
reset_idx = np.maximum.accumulate(reset_idx)
cumsum = np.cumsum(x)
cumsum = cumsum - cumsum[reset_idx]
return pd.Series(cumsum[1:], index=entry.index)
I manage to avoid the loop and this solution is (depending on max_time) up to 100x faster for me - but there is probably still potential for further optimization.

Categories

Resources