I would like to select a specified number of lines after a condition is verified:
Here is my dataframe :
I would like to select three columns after the entry is equal to 1, so for the first occurrence I would obtain something like that :
I don't know what's the most appropriate output if I want to study every occurrence, maybe a groupby ?
First remove 0 rows before first 1:
df = df[df['entry'].eq(1).cumsum().ne(0)]
df = df.groupby(df['entry'].cumsum()).head(4)
Timestamp entry
1 11.2 1
2 11.3 0
3 11.4 0
4 11.5 0
7 11.8 1
8 11.9 0
9 12.0 0
10 12.1 0
Details & explanation:
For general solution for remove all values before first match is used compare by Series.eq, then cumulative sum by Series.cumsum and compare by Series.ne - so filter out all 0 values after cumsum operation:
print (df.assign(comp1 = df['entry'].eq(1),
cumsum =df['entry'].eq(1).cumsum(),
mask = df['entry'].eq(1).cumsum().ne(0)))
Timestamp entry comp1 cumsum mask
0 11.1 0 False 0 False
1 11.2 1 True 1 True
2 11.3 0 False 1 True
3 11.4 0 False 1 True
4 11.5 0 False 1 True
5 11.6 0 False 1 True
6 11.7 0 False 1 True
7 11.8 1 True 2 True
8 11.9 0 False 2 True
9 12.0 0 False 2 True
10 12.1 0 False 2 True
After filter by boolean indexing create helper Series with cumulative sum for groups:
print (df['entry'].cumsum())
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
10 2
Name: entry, dtype: int64
So for final solution use GroupBy.head with 4 values for get rows with 1 and next 3 rows:
df = df.groupby(df['entry'].cumsum()).head(4)
print (df)
Timestamp entry
1 11.2 1
2 11.3 0
3 11.4 0
4 11.5 0
7 11.8 1
8 11.9 0
9 12.0 0
10 12.1 0
For loop by groups use:
for i, g in df.groupby(df['entry'].cumsum()):
print (g.head(4))
If want output list of DataFrames:
L = [g.head(4) for i, g in df.groupby(df['entry'].cumsum())]
Related
Let's say I have the following pandas.dataframe:
data
series time_idx value
0 0 0 -0.000000
1 0 1 0.018844
2 0 2 0.028694
3 0 3 0.050784
4 0 4 0.067037
... ... ... ...
3995 9 395 0.973978
3996 9 396 0.944002
3997 9 397 1.001089
3998 9 398 1.132001
3999 9 399 1.169244
4000 rows × 3 columns
I want to test if for each series (0..9) the time indexes are incremented by 1 from row to row and if not where the difference is?
I thought about sorting the dataframe by series and by time_index and then compare to the index mod 400, but it's not a nice solution. Any suggestions?
Thanks
The following is based on what I understand from your question. See if this answers your question. I have to use 'True' instead of Boolean True because the dataframe converts it to the numeric 1.0.
df['IncOne'] = (df.series==df.series.shift())
df['IncOne'] = (
np.where(df.IncOne,
np.where( df.time_idx.eq(df.time_idx.shift()+1),
'True' , df.time_idx-df.time_idx.shift() ),
''))
series
time_idx
value
IncOne
0
0
0
0
1
0
1
0.018844
True
2
0
2
0.028694
True
3
0
3
0.050784
True
4
0
4
0.067037
True
5
0
6
0
2.0
6
0
7
0.018844
True
7
0
8
0.028694
True
8
0
9
0.050784
True
9
0
12
0.067037
3.0
10
0
13
1
True
11
9
395
0.973978
12
9
396
0.944002
True
13
9
397
1.00109
True
14
9
398
1.132
True
15
9
399
1.16924
True
Assuming that the dataframe is df you can try this:
df["diff"] = df.groupby(by="series")["time_idx"].diff().fillna(1) != 1
It will create a new column "diff" with boolean values. A True value indicates that the difference between the time_idx value in the current row and the row preceding it is different than one. Only differences between rows corresponding to the same series can give a True value.
I have a data like this.
I calculate the mean of each IDs
df.groupby(['ID'], as_index= False)['A'].mean()
Now, I want to drop all those Ids whose mean value is more than 3
df.drop(df[df.A > 3].index)
And this is here i am stucked. I want to save the file but in original format (without grouping and no mean value) and without those Ids whose means were more than 3.
Any Idea How can i achieve this. Output something like this. Also I want to know how many unique Ids were removed while using drop.
Use transform for Series with same size as original DataFrame, so is possible filtering by changed condition from > 3 to <=3 by boolean indexing:
df1 = df[df.groupby('ID')['A'].transform('mean') <= 3]
print (df1)
ID A
0 1 2
1 1 3
2 1 1
6 3 6
7 3 1
8 3 1
9 3 1
Details:
print (df.groupby('ID')['A'].transform('mean'))
0 2.000000
1 2.000000
2 2.000000
3 6.666667
4 6.666667
5 6.666667
6 2.250000
7 2.250000
8 2.250000
9 2.250000
Name: A, dtype: float64
print (df.groupby('ID')['A'].transform('mean') <= 3)
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 True
9 True
Name: A, dtype: bool
Another solution using groupby and filter. This solutions is a slower than using transform with boolean indexing.
df.groupby('ID').filter(lambda x: x['A'].mean() < 3)
Output:
ID A
0 1 2
1 1 3
2 1 1
6 3 6
7 3 1
8 3 1
9 3 1
I have a pandas array which has one column which is either true or false (titled 'condition' in the example below). I would like to group the array by consecutive true or false values. I have tried to use pandas.groupby but haven't succeeded using that method, albeit I think that's down to my lack of understanding. An example of the dataframe can be found below:
df = pd.DataFrame(df)
print df
print df
index condition H t
0 1 2 1.1
1 1 7 1.5
2 0 1 0.9
3 0 6.5 1.6
4 1 7 1.1
5 1 9 1.8
6 1 22 2.0
Ideally the output of the program would be something along the lines of what can be found below. I was thinking of using some sort of 'grouping' method to make it easier to call each set of results but not sure if this is the best method. Any help would be greatly appreciated.
index condition H t group
0 1 2 1.1 1
1 1 7 1.5 1
2 0 1 0.9 2
3 0 6.5 1.6 2
4 1 7 1.1 3
5 1 9 1.8 3
6 1 22 2.0 3
Since you're dealing with 0/1s, here's another alternative using diff + cumsum -
df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1
df
condition H t group
index
0 1 2.0 1.1 1
1 1 7.0 1.5 1
2 0 1.0 0.9 2
3 0 6.5 1.6 2
4 1 7.0 1.1 3
5 1 9.0 1.8 3
6 1 22.0 2.0 3
If you don't mind floats, this can be made a little faster.
df['group'] = df.condition.diff().abs().cumsum() + 1
df.loc[0, 'group'] = 1
df
index condition H t group
0 0 1 2.0 1.1 1.0
1 1 1 7.0 1.5 1.0
2 2 0 1.0 0.9 2.0
3 3 0 6.5 1.6 2.0
4 4 1 7.0 1.1 3.0
5 5 1 9.0 1.8 3.0
6 6 1 22.0 2.0 3.0
Here's the version with numpy equivalents -
df['group'] = 1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
df
condition H t group
index
0 1 2.0 1.1 1
1 1 7.0 1.5 1
2 0 1.0 0.9 2
3 0 6.5 1.6 2
4 1 7.0 1.1 3
5 1 9.0 1.8 3
6 1 22.0 2.0 3
On my machine, here are the timings -
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1
10 loops, best of 3: 25.1 ms per loop
%%timeit
df['group'] = df.condition.diff().abs().cumsum() + 1
df.loc[0, 'group'] = 1
10 loops, best of 3: 23.4 ms per loop
%%timeit
df['group'] = 1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
10 loops, best of 3: 21.4 ms per loop
%timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 15.8 ms per loop
Compare with ne (!=) by shifted column and then use cumsum:
df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
print (df)
condition H t group
index
0 1 2.0 1.1 1
1 1 7.0 1.5 1
2 0 1.0 0.9 2
3 0 6.5 1.6 2
4 1 7.0 1.1 3
5 1 9.0 1.8 3
6 1 22.0 2.0 3
Detail:
print (df['condition'].ne(df['condition'].shift()))
index
0 True
1 False
2 True
3 False
4 True
5 False
6 False
Name: condition, dtype: bool
Timings:
df = pd.concat([df]*100000).reset_index(drop=True)
In [54]: %timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 12.2 ms per loop
In [55]: %timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1
10 loops, best of 3: 24.5 ms per loop
In [56]: %%timeit
...: df['group'] = 1
...: df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
...:
10 loops, best of 3: 26.6 ms per loop
I would like to filter and replace. For the columns with are lower or higher than zero and not NaN's, I would like to set for one and the others, set to zero.
mask = ((ts[x] > 0)
| (ts[x] < 0))
ts[mask]=1
ts[ts[x]==1]
I did this and is working but I have to deal with the values that do not attend this condition replacing with zero.
Any recommendations? I am quite confusing, and also would be better to use where function in this case?
Thanks all!
Sample Data
asset.relativeSetpoint.350
0 -60.0
1 0.0
2 NaN
3 100.0
4 0.0
5 NaN
6 -120.0
7 -245.0
8 0.0
9 123.0
10 0.0
11 -876.0
Expected result
asset.relativeSetpoint.350
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 0
11 1
You can do this by applying a logical AND on the two conditions and converting the resultant mask to integer.
df
asset.relativeSetpoint.350
0 -60.0
1 0.0
2 NaN
3 100.0
4 0.0
5 NaN
6 -120.0
7 -245.0
8 0.0
9 123.0
10 0.0
11 -876.0
(df['asset.relativeSetpoint.350'].ne(0)
& df['asset.relativeSetpoint.350'].notnull()).astype(int)
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 0
11 1
Name: asset.relativeSetpoint.350, dtype: int64
The first condition df['asset.relativeSetpoint.350'].ne(0) gets a boolean mask of all elements that are not equal to 0 (this would include <0, >0, and NaN).
The second condition df['asset.relativeSetpoint.350'].notnull() will get a boolean mask of elements that are not NaNs.
The two masks are ANDed, and converted to integer.
How about using apply?
df[COLUMN_NAME] = df[COLUMN_NAME].apply(lambda x: 1 if x != 0 else 0)
In Pandas, I'm trying to figure out how to generate a column that is the difference between the time of the current row and time of the last row in which the value of another column is True:
So given the dataframe:
df = pd.DataFrame({'Time':[5,10,15,20,25,30,35,40,45,50],
'Event_Occured': [True,False,False,True,True,False,False,True,False,False]})
print df
Event_Occured Time
0 True 5
1 False 10
2 False 15
3 True 20
4 True 25
5 False 30
6 False 35
7 True 40
8 False 45
9 False 50
I'm trying to generate a column that would look like this:
Event_Occured Time Time_since_last
0 True 5 0
1 False 10 5
2 False 15 10
3 True 20 0
4 True 25 0
5 False 30 5
6 False 35 10
7 True 40 0
8 False 45 5
9 False 50 10
Thanks very much!
Using df.Event_Occured.cumsum() gives you distinct groups to groupby. Then applying a function per group that subtracts the first member's value from every member gets you what you want.
df['Time_since_last'] = \
df.groupby(df.Event_Occured.cumsum()).Time.apply(lambda x: x - x.iloc[0])
df
Here's an alternative that fills the values corresponding to Falses with the last valid observation:
df['Time'] - df.loc[df['Event_Occured'], 'Time'].reindex(df.index).ffill()
Out:
0 0.0
1 5.0
2 10.0
3 0.0
4 0.0
5 5.0
6 10.0
7 0.0
8 5.0
9 10.0
Name: Time, dtype: float64