Assume I have the following dataframe:
Time Flag1
0 0
10 0
30 0
50 1
70 1
90 0
110 0
My goal is to identify if within any window that time is less than lets the number in the row plus 35, then if any element of flag is 1 then that row would be 1. For example consider the above example:
The first element of time is 0 then 0 + 35 = 35 then in the window of values less than 35 (which is Time =0, 10, 30) all the flag1 values are 0 therefore the first row will be assigned to 0 and so on. Then the next window will be 10 + 35 = 45 and still will include (0,10,30) and the flag is still 0. So the complete output is:
Time Flag1 Output
0 0 0
10 0 0
30 0 1
50 1 1
70 1 1
90 1 1
110 1 1
To implement this type of problem, I thought I can use two for loops like this:
Output = []
for ii in range(Data.shape[0]):
count =0
th = Data.loc[ii,'Time'] + 35
for jj in range(ii,Data.shape[0]):
if (Data.loc[jj,'Time'] < th and Data.loc[jj,'Flag1'] == 1):
count = 1
break
output.append(count)
However this looks tedious. since the inner for loop should go for continue for the entire length of data. Also I am not sure if this method checks the boundary cases for out of bound index when we are reaching to end of the dataframe. I appreciate if someone can comment on something easier than this. This is like a sliding window operation only comparing number to a threshold.
Edit: I do not want to compare two consecutive rows only. I want if for example 30 + 35 = 65 then as long as time is less than 65 then if flag1 is 1 then output is 1.
The second example:
Time Flag1 Output
0 0 0
30 0 1
40 0 1
60 1 1
90 1 1
140 1 1
200 1 1
350 1 1
Assuming a window k rows before and k rows after as mentioned in my comment:
import pandas as pd
Data = pd.DataFrame([[0,0], [10,0], [30,0], [50,1], [70,1], [90,1], [110,1]],
columns=['Time', 'Flag1'])
k = 1 # size of window: up to k rows before and up to k rows after
n = len(Data)
output = [0]*n
for i in range(n):
th = Data['Time'][i] + 35
j0 = max(0, i - k)
j1 = min(i + k + 1, n) # the +1 is because range is non-inclusive of end
output[i] = int(any((Data['Time'][j0 : j1] < th) & (Data['Flag1'][j0 : j1] > 0)))
Data['output'] = output
print(Data)
gives the same output as the original example. And you can change the size of the window my modifying k.
Of course, if the idea is to check any row afterward, then just use j1 = n in my example.
import pandas as pd
Data = pd.DataFrame([[0,0],[10,0],[30,0],[50,1],[70,1],[90,1],[110,1]],columns=['Time','Flag1'])
output = Data.index.map(lambda x: 1 if any((Data.Time[x+1:]<Data.Time[x]+35)*(Data.Flag1[x+1:]==1)) else 0).values
output[-1] = Data.Flag1.values[-1]
Data['output'] = output
print(Data)
# show
Time Flag1 output
0 0 0
30 0 1
40 0 1
50 1 1
70 1 1
90 1 1
110 1 1
Related
I have a large dataframe with a price column that stays at the same value as the time increases and then will change values, and then stay at that value new value for a while before going up or down. I want to write a function that looks at the price column and creates a new column called next movement that indicates wheather or not the next movement of the price column will be up or down.
For example if the price column looked like [1,1,1,2,2,2,4,4,4,3,3,3,4,4,4,2,1] then the next movement column should be [1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,-1] with 1 representing the next movement being up 0 representing the next movement being down, and -1 representing unknown.
def make_next_movement_column(DataFrame, column):
DataFrame["next movement"] = -1
for i in range (DataFrame.shape[0]):
for j in range(i + 1, DataFrame.shape[0]):
if(DataFrame[column][j] > DataFrame[column][i]):
DataFrame["next movement"][i:j] = 1
break;
if(DataFrame[column][j] < DataFrame[column][i]):
DataFrame["next movement"][i:j] = 0
break;
i = j - 1
return DataFrame
I wrote this function and it does work, but the problem is it is horribly ineffcient. I was wondering if there was a more effcient way to write this function.
This answer doesn't seem to work because the diff method only looks at the next column but I want to find the next movement no matter how far away it is.
Annotated code
# Calculate the diff between rows
s = df['column'].diff(-1)
# Broadcast the last diff value per group
s = s.mask(s == 0).bfill()
# Select from [1, 0] depending upon the value of diff
df['next_movement'] = np.select([s <= -1, s >= 1], [1, 0], -1)
Result
column next_movement
0 1 1
1 1 1
2 1 1
3 2 1
4 2 1
5 2 1
6 4 0
7 4 0
8 4 0
9 3 1
10 3 1
11 3 1
12 4 0
13 4 0
14 4 0
15 2 0
16 1 -1
Consider this pandas dataframe where the condition column is 1 when value is below 5 (any threshold).
import pandas as pd
d = {'value': [30,100,4,0,80,0,1,4,70,70],'condition':[0,0,1,1,0,1,1,1,0,0]}
df = pd.DataFrame(data=d)
df
Out[1]:
value condition
0 30 0
1 100 0
2 4 1
3 0 1
4 80 0
5 0 1
6 1 1
7 4 1
8 70 0
9 70 0
What I want is to have all consecutive values below 5 to have the same id and all values above five have 0 (or NA or a negative value, doesn't matter, they just need to be the same). I want to create a new column called new_id that contains these cumulative ids as follows:
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
In a very inefficient for loop I would do this (which works):
for i in range(0,df.shape[0]):
if (df.loc[df.index[i],'condition'] == 1) & (df.loc[df.index[i-1],'condition']==0):
new_id = counter # assign new id
counter += 1
elif (df.loc[df.index[i],'condition']==1) & (df.loc[df.index[i-1],'condition']!=0):
new_id = counter-1 # assign current id
elif (df.loc[df.index[i],'condition']==0):
new_id = df.loc[df.index[i],'condition'] # assign 0
df.loc[df.index[i],'new_id'] = new_id
df
But this is very inefficient and I have a very big dataset. Therefore I tried different kinds of vectorization but I so far failed to keep it from counting up inside each "cluster" of consecutive points:
# First try using cumsum():
df['new_id'] = 0
df['new_id_temp'] = ((df['condition'] == 1)).astype(int).cumsum()
df.loc[(df['condition'] == 1), 'new_id'] = df['new_id_temp']
df[['value', 'condition', 'new_id']]
# Another try using list comprehension but this just does +1:
[row+1 for ind, row in enumerate(df['condition']) if (row != row-1)]
I also tried using apply() with a custom if else function but it seems like this does not allow me to use a counter.
There is already a ton of similar posts about this but none of them keep the same id for consecutive rows.
Example posts are:
Maintain count in python list comprehension
Pandas cumsum on a separate column condition
Python - keeping counter inside list comprehension
python pandas conditional cumulative sum
Conditional count of cumulative sum Dataframe - Loop through columns
You can use the cumsum(), as you did in your first try, just modify it a bit:
# calculate delta
df['delta'] = df['condition']-df['condition'].shift(1)
# get rid of -1 for the cumsum (replace it by 0)
df['delta'] = df['delta'].replace(-1,0)
# cumulative sum conditional: multiply with condition column
df['cumsum_x'] = df['delta'].cumsum()*df['condition']
Welcome to SO! Why not just rely on base Python for this?
def counter_func(l):
new_id = [0] # First value is zero in any case
counter = 0
for i in range(1, len(l)):
if l[i] == 0:
new_id.append(0)
elif l[i] == 1 and l[i-1] == 0:
counter += 1
new_id.append(counter)
elif l[i] == l[i-1] == 1:
new_id.append(counter)
else: new_id.append(None)
return new_id
df["new_id"] = counter_func(df["condition"])
Looks like this
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
Edit :
You can also use numba, which sped up the function quite a lot for me about : about 1sec to ~60ms.
You should input numpy arrays into the function to use it, meaning you'll have to df["condition"].values.
from numba import njit
import numpy as np
#njit
def func(arr):
res = np.empty(arr.shape[0])
counter = 0
res[0] = 0 # First value is zero anyway
for i in range(1, arr.shape[0]):
if arr[i] == 0:
res[i] = 0
elif arr[i] and arr[i-1] == 0:
counter += 1
res[i] = counter
elif arr[i] == arr[i-1] == 1:
res[i] = counter
else: res[i] = np.nan
return res
df["new_id"] = func(df["condition"].values)
The following function counts the number of points within different segments of a circle. This function works as intended when exporting counts for a single point in time. However, when trying export this count at different points in time using a groupby call, it still combines all counts to a single output.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Time' : ['19:50:10.1','19:50:10.1','19:50:10.1','19:50:10.1','19:50:10.2','19:50:10.2','19:50:10.2','19:50:10.2'],
'id' : ['A','B','C','D','A','B','C','D'],
'x' : [1,8,0,-5,1,-1,-6,0],
'y' : [-5,2,-5,2,5,-5,-2,2],
'X2' : [0,0,0,0,0,0,0,0],
'Y2' : [0,0,0,0,0,0,0,0],
'Angle' : [0,0,0,0,0,0,0,0],
})
def checkPoint(x, y, rotation_angle, refX, refY, radius = 10):
section_angle_start = [(i + rotation_angle - 45) for i in [0, 90, 180, 270, 360]]
Angle = np.arctan2(x-refX, y-refY) * 180 / np.pi
Angle = Angle % 360
# adjust range
if Angle > section_angle_start[-1]:
Angle -= 360
elif Angle < section_angle_start[0]:
Angle += 360
for i in range(4):
if section_angle_start[i] < Angle < section_angle_start[i+1]:
break
else:
i = 0
return i+1
tmp = []
result = []
The following is my attempt to pass the checkPoint function to each unique group in Time.
for group in df.groupby('Time'):
for i, row in df.iterrows():
seg = checkPoint(row.x, row.y, row.Angle, row.X2, row.Y2)
tmp.append(seg)
result.append([tmp.count(i) for i in [1,2,3,4]])
df = pd.DataFrame(result, columns = ['1','2','3','4'])
Out:
1 2 3 4
0 2 1 3 2
1 4 2 6 4
Intended Out:
1 2 3 4
0 0 1 2 1
1 2 0 1 1
Your inner loop is running through your entire DataFrame, and generating the double-counting you are observing.
As #Kenan suggested, you can limit the inner loop to the group:
for group in df.groupby('Time'):
for i, row in group[1].iterrows():
seg = checkPoint(row.x_live, row.y_live, row.Angle, row.BallX, row.BallY)
tmp.append(seg)
result.append([tmp.count(i) for i in [1,2,3,4]])
df_result = pd.DataFrame(result, columns = ['1','2','3','4'])
print(df_result)
Resulting in
1 2 3 4
0 0 1 2 1
1 2 1 3 2
Or you can use a groupby-apply construct to avoid the explicit loop:
def result(g):
tmp = []
for i, row in g.iterrows():
seg = checkPoint(row.x_live, row.y_live, row.Angle, row.BallX, row.BallY)
tmp.append(seg)
return pd.Series([tmp.count(i) for i in [1,2,3,4]], index=[1,2,3,4])
print(df.groupby('Time').apply(result))
Which gets you:
1 2 3 4
Time
19:50:10.1 0 1 2 1
19:50:10.2 2 0 1 1
I have a pandas dataframe and I want to loop over the last column "n" times based on a condition.
import random as random
import pandas as pd
p = 0.5
df = pd.DataFrame()
start = []
for i in range(5)):
if random.random() < p:
start.append("0")
else:
start.append("1")
df['start'] = start
print(df['start'])
Essentially, I want to loop over the final column "n" times and if the value is 0, change it to 1 with probability p so the results become the new final column. (I am simulating on-off every time unit with probability p).
e.g. after one iteration, the dataframe would look something like:
0 0
0 1
1 1
0 0
0 1
after two:
0 0 1
0 1 1
1 1 1
0 0 0
0 1 1
What is the best way to do this?
Sorry if I am asking this wrong, I have been trying to google for a solution for hours and coming up empty.
Like this. Append col with name 1, 2, ...
# continue from question code ...
# colname is 1, 2, ...
for col in range(1, 5):
tmp = []
for i in range(5):
# check final col
if df.iloc[i,col-1:col][0] == "0":
if random.random() < p:
tmp.append("0")
else:
tmp.append("1")
else: # == 1
tmp.append("1")
# append new col
df[str(col)] = tmp
print(df)
# initial
s
0 0
1 1
2 0
3 0
4 0
# result
s 1 2 3 4
0 0 0 1 1 1
1 0 0 0 0 1
2 0 0 1 1 1
3 1 1 1 1 1
4 0 0 0 0 0
I currently am trying to create a function for a dataframe and is too complex for me. I have a dataframe that looks like this:
df1
hour production ....
0 1 10
0 2 20
0 1 30
0 3 40
0 1 40
0 4 30
0 1 20
0 4 10
I am trying to create a function that would do the following:
Group data by different hour
Calculate 90% confidence interval of production for each hour
If production value of a particular row falls outside the 90% confidence interval for it's respective hour, mark it as unusual by creating a new column
Below is the current step I am taking to do the above for each individual hours:
Calculate confidence interval
confidence = 0.90
data = df1['production ']
n = len(data)
m = mean(data)
std_err = sem(data)
h = std_err * t.ppf((1 + confidence) / 2, n - 1)
lower_interval = m - h
upper_interval = m + h
Then:
def confidence_interval(x):
if x['production'] > upper_interval :
return 1
if x['production'] < lower_interval :
return 1
return 0
df1['unusual'] = df1.apply (lambda x: confidence_interval(x), axis=1)
I am doing this for each of the values in hour, than having to merge all the result together into one original dataframe.
Can anyone help me to crate a function that can do all the above at once? I had a go but just cant get my head around it.
Create custom function and use GroupBy.transform with Series.between and invert mask by ~:
from scipy.stats import sem, t
from scipy import mean
def confidence_interval(data):
confidence = 0.90
n = len(data)
m = mean(data)
std_err = sem(data)
h = std_err * t.ppf((1 + confidence) / 2, n - 1)
lower_interval = m - h
upper_interval = m + h
#print (lower_interval ,upper_interval)
return ~data.between(lower_interval, upper_interval, inclusive=False)
df1['new'] = df1.groupby('hour')['production'].transform(confidence_interval).astype(int)
print (df1)
hour production new
0 1 10 0
0 2 20 1
0 1 30 0
0 3 40 1
0 1 40 0
0 4 30 0
0 1 20 0
0 4 10 0