In a pandas dataframe, I want to filter the rows where some column is stable within 10.0 units.
def abs_delta_fn(window):
x = window[0]
for y in window[1:]:
if abs(x-y) > 10.0:
return False
return True
df['filter']= df['column'].rolling(5, min_periods=5).apply(abs_delta)
So, if a have a df like this
1 0
2 20
3 40
4 40
5 40
6 40
7 40
8 90
9 120
10 120
applying the rolling window I get:
1 0 nan
2 20 nan
3 40 nan
4 40 nan
5 40 False
6 40 False
7 40 True
8 90 False
9 120 False
10 120 False
How can I get this in a smart way?
1 0 nan (or False)
2 20 nan (or False)
3 40 True
4 40 True
5 40 True
6 40 True
7 40 True
8 90 False
9 120 False
10 120 False
IIUC, you already know rolling, just adding apply after that , the key here is .iloc[::-1], cause rolling is from the current row looking up(backward), but you need forward
s=df.x.iloc[::-1].rolling(5,min_periods=5).apply(lambda x : (abs((x-x[0]))<10).all())
df.loc[df.index.difference(sum([list(range(x, x+5))for x in s[s==1].index.values],[]))]
Out[1119]:
x
1 0
2 20
8 90
9 120
10 120
Related
Got a panda's dataframe with below structure:
Index Name Value Other
1 NaN 10 5
2 A 20 2
3 30 3
4 100 12
5 NaN 40 10
6 C 10 1
7 40 10
8 40 10
9 40 10
10 NaN 40 10
11 D 10 1
12 NaN 40 10
...
I need to copy a value from column Name in rows that have it to rows that are below it until it finds NaN or other value? So how do i approach a copying a name A to row 3,4? then C [row 6] copying to row 7,8,9... until NaN/SomeName?
so after running the code it should get a dataframe like it:
Index Name Value Other
1 NaN 10 5
2 A 20 2
3 A 30 3
4 A 100 12
5 NaN 40 10
6 C 10 1
7 C 40 10
8 C 40 10
9 C 40 10
10 NaN 40 10
11 D 10 1
12 NaN 40 10
Just use replace():
df=df.replace('nan',float('NaN'),regex=True)
#for converting string 'nan' to Actual NaN
df['Name']=df['Name'].replace('',method='ffill')
#for forward filling '' values
output:
Index Name Value Other
1 NaN 10 5
2 A 20 2
3 A 30 3
4 A 100 12
5 NaN 40 10
6 C 10 1
7 C 40 10
8 C 40 10
9 C 40 10
10 NaN 40 10
11 D 10 1
12 NaN 40 10
I have two dataframes:
a = pd.DataFrame({'id': [10, 20, 30, 40, 50, 60, 70]})
b = pd.DataFrame({'id': [10, 30, 40, 70]})
print(a)
print(b)
# a
id
0 10
1 20
2 30
3 40
4 50
5 60
6 70
# b
id
0 10
1 30
2 40
3 70
I am trying to have an extra column in a if id is present on b like so:
# a
id present
0 10 True
1 20 False
2 30 True
3 40 True
4 50 False
5 60 False
6 70 True
What I've tried:
a.join(b,rsuffix='a')
# and then thought I'd replace nans with False and values with True
# but it does not return what I expect as it joined row by row
id ida
0 10 10.000
1 20 30.000
2 30 40.000
3 40 70.000
4 50 nan
5 60 nan
6 70 nan
Then I added:
a.join(b,rsuffix='a', on='id')
But did not get what I expected as well:
id ida
0 10 nan
1 20 nan
2 30 nan
3 40 nan
4 50 nan
5 60 nan
6 70 nan
I also tried a['present'] = b['id'].isin(a['id']) but that returned not what I expect:
id present
0 10 True
1 20 True
2 30 True
3 40 True
4 50 NaN
5 60 NaN
6 70 NaN
How can I have an extra column in a denoting if id is present in b with True / False statements?
You are close, need test a['id'] with b['id'] in Series.isin:
a['present'] = a['id'].isin(b['id'])
print (a)
id present
0 10 True
1 20 False
2 30 True
3 40 True
4 50 False
5 60 False
6 70 True
With merge is possible use parameter indicator=True in left join and test _merge column for both:
a['present'] = a.merge(b, on='id', how='left', indicator=True)['_merge'].eq('both')
print (a)
id present
0 10 True
1 20 False
2 30 True
3 40 True
4 50 False
5 60 False
6 70 True
I have pandas column and I want to do a sum on it using the previous value when 0 is encountered in it. It will be more clear through this example -
ds = pd.DataFrame([0,1,2,3,4,50,0,1,3,5,55,0,5], columns = ['a'])
print(ds)
a
0 0
1 1
2 2
3 3
4 4
5 50
6 0
7 1
8 3
9 5
10 55
11 0
12 5
Output should be -
a
0 0
1 1
2 2
3 3
4 4
5 50
6 50
7 51
8 53
9 55
10 105
11 105
12 110
Try with shift then where to mask all not 0 to NaN and then do cumsum , since you need the pervious fill value add again
df.a = df.a.add(df.a.shift().where(df.a.eq(0)).cumsum().ffill(),fill_value=0)
Out[132]:
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 50.0
6 50.0
7 51.0
8 53.0
9 55.0
10 105.0
11 105.0
12 110.0
Name: a, dtype: float64
One can exploit .diff(-1) to obtain the streak-ending location (50 and 55) directly:
First, put .diff(-1) into .where() to retain the streak-ending elements while filling other elements with 0.
Second, perform cumsum(), shift the result forward by 1 with fill_value=0, and add this to the original data.
Code:
ds["a"] += ds["a"].where(ds["a"].diff(-1) > 0, other=0).cumsum().shift(fill_value=0)
Result:
print(ds)
a
0 0
1 1
2 2
3 3
4 4
5 50
6 50
7 51
8 53
9 55
10 105
11 105
12 110
I has a huge pandas data frame that looks like this:
id type price min max
1 ch 10 10 100
1 fo 8 20 100
1 dr 7 10 90
1 ad 5 16 20
1 dr 6 10 90
1 fo 4 20 100
2 ch 5 40 50
2 fo 3 10 50
2 ch 3 40 50
... ... ... ... ...
I would like to add a new column 'match' to get something such this:
id type price min max match
1 ch 10 10 100 false
1 fo 8 20 100 false
1 dr 7 10 90 false
1 ad 5 16 20 false
1 dr 6 10 90 true
1 fo 4 20 100 true
2 ch 5 40 50 false
2 fo 3 10 50 false
2 ch 3 40 50 true
... ... ... ... ... ...
I tried using shift:
df['match']=np.where((df['id'] == df['id'].shift()) & (df['type'] == df['type'].shift()) & (df['min'] == df['min'].shift()) & (df['max'] == df['max'].shift()),true, false)
but that just compares the current row with the previous one.There is no specific pattern to determine the number of previous rows that match the condition. I would like to choose the id as a window to compare rows.Is there a way to do that?
Any suggestions are highly appreciated.
Thank you
You could use duplicated specifying the subset of columns to consider:
df.assign(match=df.duplicated(subset=['id', 'type', 'min', 'max']))
id type price min max match
0 1 ch 10 10 100 False
1 1 fo 8 20 100 False
2 1 dr 7 10 90 False
3 1 ad 5 16 20 False
4 1 dr 6 10 90 True
5 1 fo 4 20 100 True
6 2 ch 5 40 50 False
7 2 fo 3 10 50 False
8 2 ch 3 40 50 True
Best described by an example. Input is
ts val
0 10 False
1 20 True
2 20 False
3 30 True
4 40 False
5 40 False
6 40 False
7 60 True
8 60 False
desired output is
ts val
0 10 False
1 20 True
2 20 True
3 30 True
4 40 False
5 40 False
6 40 False
7 60 True
8 60 True
The idea is as follows: if we see at least one True value inside the same ts cluster(i.e. same ts value), make all other values True that have the exact same timestamp.
You can use groupby on column 'ts', and then apply using .any() to determine whether any of val is True in the cluster/group.
import pandas as pd
# your data
# =====================
print(df)
Out[58]:
ts val data
0 10 False 0.3332
1 20 True -0.6877
2 20 False -0.6004
3 30 True 0.1922
4 40 False 0.2472
5 40 False -0.0117
6 40 False 0.8607
7 60 True -1.1464
8 60 False 0.0698
# processing
# =====================
# as suggested by #DSM, transform is best way to do it
df['val'] = df.groupby('ts')['val'].transform(any)
Out[61]:
ts val data
0 10 False 0.3332
1 20 True -0.6877
2 20 True -0.6004
3 30 True 0.1922
4 40 False 0.2472
5 40 False -0.0117
6 40 False 0.8607
7 60 True -1.1464
8 60 True 0.0698