I have a dataframe from which I want to select data between a range, only the first occurrence of this range.
The dataframe:
data = {'x':[1,2,3,4,5,6,7,6.5,5.5,4.5,3.5,2.5,1], 'y':[1,4,3,3,52,3,74,64,15,41,31,12,11]}
df = pd.DataFrame(data)
eg: select x from 2 to 6, first occurarence:
x y
0 1.0 1 #out of range
1 2.0 4 #out of range
2 3.0 3 #this first occurrence
3 4.0 3 #this first occurrence
4 5.0 52 #thisfirst occurrence
5 6.0 3 #out of range
6 7.0 74 #out of range
7 6.5 64 #out of range
8 5.5 15 #not this since repeating RANGE
9 4.5 41 #not this since repeating RANGE
10 3.5 31 #not this since repeating RANGE
11 2.5 12 #not this since repeating RANGE
12 1.0 11 #out of range
Output
x y
2 3.0 3 #this first occurrence
3 4.0 3 #this first occurrence
4 5.0 52 #thisfirst occurrence
I am trying to modify this example: Select DataFrame rows between two dates to select data between 2 values for their first occurrence:
xlim=[2,6]
mask = (df['x'] > xlim[0]) & (df['x'] <= xlim[1])
df=df.loc[mask] #need to make it the first occurrence here
Here's one approach:
# mask with True whenever a value is within the range
m = df.x.between(2,6, inclusive=False)
# logical XOR with the next row and cumsum
# Keeping only 1s will result in the dataframe of interest
df.loc[(m ^ m.shift()).cumsum().eq(1)]
x y
2 3.0 3
3 4.0 3
4 5.0 52
Details -
df.assign(in_range=m, is_next_different=(m ^ m.shift()).cumsum())
x y in_range is_next_different
0 1.0 1 False 0
1 2.0 4 False 0
2 3.0 3 True 1
3 4.0 3 True 1
4 5.0 52 True 1
5 6.0 3 False 2
6 7.0 74 False 2
7 6.5 64 False 2
8 5.5 15 True 3
9 4.5 41 True 3
10 3.5 31 True 3
11 2.5 12 True 3
12 1.0 11 False 4
Related
I have this df
d={}
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
I would like to create a column that is going to have the following non-equal value of column qty. Meaning that if qty is equal to 5 and its next row is 5 I am going to skip it and look until I find next value not equal to 5, In my case it is 6. And all this should be grouped by id
Here is the desired dataframe.
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
d['qty2']=[6,6,6,6,6,5,'NAN','NAN',2,2,3,3,3,5,8,'NAN']
Any help is very much appreciated
You can groupby.shift, mask the identical values, and groupby.bfill:
# shift up per group
s = df.groupby('id')['qty'].shift(-1)
# keep only the different values and bfill per group
df['qty2'] = s.where(df['qty'].ne(s)).groupby(df['id']).bfill()
output:
id qty qty2
0 1 5 6.0
1 1 5 6.0
2 1 5 6.0
3 1 5 6.0
4 1 5 6.0
5 1 6 5.0
6 1 5 NaN
7 1 5 NaN
8 2 1 2.0
9 2 1 2.0
10 2 2 3.0
11 2 2 3.0
12 2 2 3.0
13 2 3 5.0
14 2 5 8.0
15 2 8 NaN
I currently a dataframe that contains a column called load, and I want to create a column called calculated load, that uses a simple formula on the column load, and a variable. However, I want the calculation to change the variable it uses when it sees the value 1 in a column called postition, and uses that formula until it sees -1 in position, when the values start to rise again. Here is my current code:
import pandas as pd
s_falling = -4
s_rising = 2
x = 2
df = pd.DataFrame({"load": [1,2,4,6,2,4,7,4,8,3,4,7,3,3,6,4,7,4,3,2],
"position": [0,0.2,0.5,0.8,0.7,1,0.7,0.6,0.7,0.8,0.4,0.2,0,-0.5,-0.8,-1,-0.8,-0.9,-0.7,-0.6]})
df['calculated load'] = df['load'] + x * s_rising
print(df['calculated load'])
0 5
1 6
2 8
3 10
4 6
5 8
6 11
7 8
8 12
9 7
10 8
11 11
12 7
13 7
14 10
15 8
16 11
17 8
18 7
19 6
This works up to the position after 1, when the values start falling, I want to use this formula that swaps s_rising for s_falling, and continues to use this new variable with the formula iterating over the column, and then reverts back to the original formula using the variable s_rising from the position after -1 is seen again:
df['calculated load'] = df['load'] + x *s_falling
The formula doesn't change, merely the variable being used within it.
I can't just check if the value after is less than or more than the previous value, as the values in position don't rise and fall perfectly. Ideally, this would be my desired output:
print(df['calculated load'])
0 5
1 6
2 8
3 10
4 6
5 8
6 3
7 0
8 4
9 -1
10 0
11 3
12 -1
13 -1
14 2
15 0
16 11
17 8
18 7
19 8
EDIT: Some very kind people have offered solutions, and I have realised that my question (designed to produce a small, reproducible example) was slightly off the mark. I have edited the question to reflect this.
Check this & let me know if it's work.
x = 2
df = pd.DataFrame({"load": [1,2,4,6,2,4,7,4,8,3,4,7,3,3,6,4,7,4,3,2],
"position": [0,0.2,0.5,0.8,0.7,1,0.7,0.6,0.7,0.8,0.4,0.2,0,-0.5,-0.8,-1,-0.8,-0.9,-0.7,-0.6]})
for i, row in df.iterrows():
if df[df['position']==1.0].index[0]<=i<df[df['position']==-1.0].index[0]:
df.loc[i, 'calculated load'] = df.loc[i, 'load'] - x
else:
df.loc[i, 'calculated load'] = df.loc[i, 'load'] + x
print(df)
load position calculated load
0 1 0.0 3.0
1 2 0.2 4.0
2 4 0.5 6.0
3 6 0.8 8.0
4 2 0.7 4.0
5 4 1.0 2.0
6 7 0.7 5.0
7 4 0.6 2.0
8 8 0.7 6.0
9 3 0.8 1.0
10 4 0.4 2.0
11 7 0.2 5.0
12 3 0.0 1.0
13 3 -0.5 1.0
14 6 -0.8 4.0
15 4 -1.0 6.0
16 7 -0.8 9.0
17 4 -0.9 6.0
18 3 -0.7 5.0
19 2 -0.6 4.0
I believe this code is working, but it's not efficient because of itterrows(). If someone find a way to vectorize it you can comment my answer.
import pandas as pd
x = 2
df = pd.DataFrame({"load": [1,2,4,6,2,4,7,4,8,3,4,7,3,3,6,4,7,4,3,2],
"position": [0,0.2,0.5,0.8,0.7,1,0.7,0.6,0.7,0.8,0.4,0.2,0,-0.5,-0.8,-1,-0.8,-0.9,-0.7,-0.6]})
increasing = True
list_increasing = []
for index, row in df.iterrows():
if increasing and row.position == 1:
increasing = False
elif not increasing and row.position == -1:
increasing = True
list_increasing.append(increasing)
df['increasing'] = list_increasing
def calculated_load(row):
if row.increasing:
return row.load + x
else:
return row.load - x
df['cal load'] = df.apply(calculated_load, axis=1)
Without loop, you can use:
x1 = df['position'].eq(1).mul(-x).shift(fill_value=0)
x2 = df['position'].eq(-1).mul(x)
xm = (p1|p2).replace(0, np.nan).ffill().fillna(x).astype(int)
df['calculated load'] = df['load'] + xm
Output:
>>> df
load position calculated load
0 1 0.0 3
1 2 0.2 4
2 4 0.5 6
3 6 0.8 8
4 2 0.7 4
5 4 1.0 6
6 7 0.7 5
7 4 0.6 2
8 8 0.7 6
9 3 0.8 1
10 4 0.4 2
11 7 0.2 5
12 3 0.0 1
13 3 -0.5 1
14 6 -0.8 4
15 4 -1.0 6
16 7 -0.8 9
17 4 -0.9 6
18 3 -0.7 5
19 2 -0.6 4
Given a dataframe df, I would like to generate a new variable/column for each row based on the values in the previous n rows (for example previous 3).
For example, given the following:
INPUT
A B C
10 2 59.4
53 3 71.5
32 2 70.4
24 3 82.1
Calculation for D: if in the actual row in C or previous 3 rows in C there are 2 or more cells > 70 then 1, else 0
OUTPUT
A B C D
10 2 59.4 0
53 3 71.5 0
32 2 70.4 1
24 3 82.1 1
How should I do it in pandas?
IIUC, should use rolling and build your logic in the apply
window = 3
df.C.rolling(window).apply(lambda s: 1 if (s>=70).size >= 2 else 0)
0 NaN
1 NaN
2 1.0
3 1.0
You can also fillna to turn NaNs into 0
.fillna(0)
0 0.0
1 0.0
2 1.0
3 1.0
I think #RafaelC's answer is the right approach. I'm adding an answer to (a) provide better example data that covers edge cases and (b) to adjust #RafaelC's syntax slightly. In particular:
min_periods = 1 allows for early rows whose index values are smaller than the window to be non-NaN
window = 4 allows for the current entry plus the previous 3 to be considered
Use sum() instead of size to get only True values
Updated code:
window = 4
df.C.rolling(window, min_periods=1).apply(lambda x: (x>70).sum()>=2)
Data:
A B C
10 2 59.4
53 3 71.5
32 2 70.4
24 3 82.1
11 4 10.1
10 5 1.0
12 3 2.3
13 2 1.1
99 9 70.2
12 9 80.0
Expected output according to OP rules:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 0.0
7 0.0
8 0.0
9 1.0
Name: C, dtype: float64
I would like to filter and replace. For the columns with are lower or higher than zero and not NaN's, I would like to set for one and the others, set to zero.
mask = ((ts[x] > 0)
| (ts[x] < 0))
ts[mask]=1
ts[ts[x]==1]
I did this and is working but I have to deal with the values that do not attend this condition replacing with zero.
Any recommendations? I am quite confusing, and also would be better to use where function in this case?
Thanks all!
Sample Data
asset.relativeSetpoint.350
0 -60.0
1 0.0
2 NaN
3 100.0
4 0.0
5 NaN
6 -120.0
7 -245.0
8 0.0
9 123.0
10 0.0
11 -876.0
Expected result
asset.relativeSetpoint.350
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 0
11 1
You can do this by applying a logical AND on the two conditions and converting the resultant mask to integer.
df
asset.relativeSetpoint.350
0 -60.0
1 0.0
2 NaN
3 100.0
4 0.0
5 NaN
6 -120.0
7 -245.0
8 0.0
9 123.0
10 0.0
11 -876.0
(df['asset.relativeSetpoint.350'].ne(0)
& df['asset.relativeSetpoint.350'].notnull()).astype(int)
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 0
11 1
Name: asset.relativeSetpoint.350, dtype: int64
The first condition df['asset.relativeSetpoint.350'].ne(0) gets a boolean mask of all elements that are not equal to 0 (this would include <0, >0, and NaN).
The second condition df['asset.relativeSetpoint.350'].notnull() will get a boolean mask of elements that are not NaNs.
The two masks are ANDed, and converted to integer.
How about using apply?
df[COLUMN_NAME] = df[COLUMN_NAME].apply(lambda x: 1 if x != 0 else 0)
In Pandas, I'm trying to figure out how to generate a column that is the difference between the time of the current row and time of the last row in which the value of another column is True:
So given the dataframe:
df = pd.DataFrame({'Time':[5,10,15,20,25,30,35,40,45,50],
'Event_Occured': [True,False,False,True,True,False,False,True,False,False]})
print df
Event_Occured Time
0 True 5
1 False 10
2 False 15
3 True 20
4 True 25
5 False 30
6 False 35
7 True 40
8 False 45
9 False 50
I'm trying to generate a column that would look like this:
Event_Occured Time Time_since_last
0 True 5 0
1 False 10 5
2 False 15 10
3 True 20 0
4 True 25 0
5 False 30 5
6 False 35 10
7 True 40 0
8 False 45 5
9 False 50 10
Thanks very much!
Using df.Event_Occured.cumsum() gives you distinct groups to groupby. Then applying a function per group that subtracts the first member's value from every member gets you what you want.
df['Time_since_last'] = \
df.groupby(df.Event_Occured.cumsum()).Time.apply(lambda x: x - x.iloc[0])
df
Here's an alternative that fills the values corresponding to Falses with the last valid observation:
df['Time'] - df.loc[df['Event_Occured'], 'Time'].reindex(df.index).ffill()
Out:
0 0.0
1 5.0
2 10.0
3 0.0
4 0.0
5 5.0
6 10.0
7 0.0
8 5.0
9 10.0
Name: Time, dtype: float64