Pandas dataframe: propagate True values if timestamp is identical - python

Best described by an example. Input is
ts val
0 10 False
1 20 True
2 20 False
3 30 True
4 40 False
5 40 False
6 40 False
7 60 True
8 60 False
desired output is
ts val
0 10 False
1 20 True
2 20 True
3 30 True
4 40 False
5 40 False
6 40 False
7 60 True
8 60 True
The idea is as follows: if we see at least one True value inside the same ts cluster(i.e. same ts value), make all other values True that have the exact same timestamp.

You can use groupby on column 'ts', and then apply using .any() to determine whether any of val is True in the cluster/group.
import pandas as pd
# your data
# =====================
print(df)
Out[58]:
ts val data
0 10 False 0.3332
1 20 True -0.6877
2 20 False -0.6004
3 30 True 0.1922
4 40 False 0.2472
5 40 False -0.0117
6 40 False 0.8607
7 60 True -1.1464
8 60 False 0.0698
# processing
# =====================
# as suggested by #DSM, transform is best way to do it
df['val'] = df.groupby('ts')['val'].transform(any)
Out[61]:
ts val data
0 10 False 0.3332
1 20 True -0.6877
2 20 True -0.6004
3 30 True 0.1922
4 40 False 0.2472
5 40 False -0.0117
6 40 False 0.8607
7 60 True -1.1464
8 60 True 0.0698

Related

Find a series in dataframe and replace it with original row

I have below dataframe df but some D4s with True was causing an issue in my custom ordering. Temporarily, I stored such rows in a list and turned those D4 values to False intentionally and sorted with my custom ordering.
Index D1 D2 D3 D4 D5
0 8 5 0 False True
1 45 35 0 True False
2 35 10 1 False True
3 40 5 0 True False
4 12 10 5 False False
5 18 15 13 False True
6 25 15 5 True False
7 35 10 11 False True
8 95 50 0 False False
hacked_rows = []
def hack_d4(row):
if row['D3'] in [0, 1]:
row['D4'] = False
hacked_rows.append(row)
return row
df_hacked = df.apply(lambda x: hack_d4(x), axis=1)
ordered_df = order_df(df_hacked) # Returns same df with some rows in custom order.
So, Technically, in short I have to revert below ordered_df to the original df with the help of list hacked_rows. Row Order is not important, only hacked rows should be replaced back in the original dataset.
Index D1 D2 D3 D4 D5
0 0 8 5 0 False True
2 2 35 10 1 False True
3 3 40 5 0 False False
1 1 45 35 0 False False
5 5 18 15 13 False True
4 4 12 10 5 False False
7 7 35 10 11 False True
8 8 95 50 0 False False
6 6 25 15 5 True False
Now I am done with custom ordering. Now I want to revert hacked_rows back to the original dataframe which are there on the list, but not sure how to replace them back.
I tried below code for one row, but no luck, its throwing TypeError:
item = hacked_rows[0]
item = item.drop('D3')
ordered_df.loc[item] # But this line is throwing error.
Note- I am okay if anyone can suggest a different approach to replace the True values temporarily.
I think the error is when you create the data frame again.
hacked_rows = []
def hack_d4(row):
if row['D3'] in [0, 1]:
row['D4'] = False
hacked_rows.append(row)
return row
df = df.apply(lambda x: hack_d4(x), axis=1)
ordered_df = pd.DataFrame(df) # code update
df
Index D1 D2 D3 D4 D5
0 0 8 5 0 False True
1 1 45 35 0 False False
2 2 35 10 1 False True
3 3 40 5 0 False False
4 4 12 10 5 False False
5 5 18 15 13 False True
6 6 25 15 5 True False
7 7 35 10 11 False True
8 8 95 50 0 False False
Update:
I added the code with the understanding that I wanted to convert the hacked_rows to a list of data frames, with the comment you gave me.
new_df = pd.DataFrame(index=[], columns=[])
for i in hacked_rows:
new_df = pd.concat([new_df, pd.Series(i)], axis=1, ignore_index=True)
new_df.stack().unstack(level=1).T
Index D1 D2 D3 D4 D5
1 0 8 5 0 False True
2 1 45 35 0 False False
3 2 35 10 1 False True
4 3 40 5 0 False False
5 8 95 50 0 False False

Python pandas dataframe backfill based on two conditions

I have a dataframe like this:
Bool Hour
0 False 12
1 False 24
2 False 12
3 False 24
4 True 12
5 False 24
6 False 12
7 False 24
8 False 12
9 False 24
10 False 12
11 True 24
and I would like to backfill the True value in 'Bool' column to the point when 'Hour' first reaches '12'. The result would be something like this:
Bool Hour Result
0 False 12 False
1 False 24 False
2 False 12 True <- desired backfill
3 False 24 True <- desired backfill
4 True 12 True
5 False 24 False
6 False 12 False
7 False 24 False
8 False 12 False
9 False 24 False
10 False 12 True <- desired backfill
11 True 24 True
Any help is greatly appreciated! Thank you very much!
This is a little bit hard to achieve , here we can use groupby with idxmax
s=(~df.Bool&df.Hour.eq(12)).iloc[::-1].groupby(df.Bool.iloc[::-1].cumsum()).transform('idxmax')
df['result']=df.index>=s.iloc[::-1]
df
Out[375]:
Bool Hour result
0 False 12 False
1 False 24 False
2 False 12 True
3 False 24 True
4 True 12 True
5 False 24 False
6 False 12 False
7 False 24 False
8 False 12 False
9 False 24 False
10 False 12 True
11 True 24 True
IIUC, you can do:
s = df['Bool'].shift(-1)
df['Result'] = df['Bool'] | s.where(s).groupby(df['Hour'].eq(12).cumsum()).bfill()
Output:
Bool Hour Result
0 False 12 False
1 False 24 False
2 False 12 True
3 False 24 True
4 True 12 True
5 False 24 False
6 False 12 False
7 False 24 False
8 False 12 False
9 False 24 False
10 False 12 True
11 True 24 True
create a groupID s on consecutive False and separate True from them. Groupby on Hour equals 12 by using s. Use transform sum and cumsum to get the count of True on 12 from bottom-up on each group and return True on 0 and or with values of Bool
s = df.Bool.ne(df.Bool.shift()).cumsum()
s1 = df.where(df.Bool).Bool.bfill()
g = df.Hour.eq(12).groupby(s)
df['bfill_Bool'] = (g.transform('sum') - g.cumsum()).eq(0) & s1 | df.Bool
Out[905]:
Bool Hour bfill_Bool
0 False 12 False
1 False 24 False
2 False 12 True
3 False 24 True
4 True 12 True
5 False 24 False
6 False 12 False
7 False 24 False
8 False 12 False
9 False 24 False
10 False 12 True
11 True 24 True

Trying to sort a pandas dataframe by a number column, but getting strange output

I have a dataframe (called df) with a length of 460 that looks like this
index Position T/F
0 1 True
1 2 False
4 3 False
8 4 False
9 18 True
13 5 False
And I would like to sort it by 'position' so that the whole dataframe looks like this
index Position T/F
0 1 True
1 2 False
4 3 False
8 4 False
13 5 False
20 6 False
28 7 True
I have attempted to use
df = df.sort_values('Position', ascending=True)
However, that outputs a rather bizarre dataframe with this form
index Position T/F
0 1 True
52 10 False
456 100 False
470 101 False
477 102 False
...
59 11 False
666 110 False
644 111 True
...
1 2 False
You get the idea. I'm not sure why it's sorting it like this, but I would like to figure out how to fix this issue so that I can output the desired DataFrame
Position seems to be string.
df['position'] = df['position'].astype(int)
Then do sorting.
df = df.sort_values('Position', ascending=True)
Output:
index Position T/F
0 1 True
1 2 False
4 3 False
8 4 False
13 5 False
20 6 False
28 7 True

Compare string condition during transform operation

My dataframe df:
SCHOOL CLASS GRADE
A Spanish nan
A Spanish nan
A Math 4000
A Math 7830
A Math 3893
B . nan
B . nan
B Biology 1929
B Biology 4839
B Biology 8195
C Spanish nan
C English 2003
C English 1000
C Biology 4839
C Biology 8191
If I do:
school_has_only_two_classes = df.groupby('SCHOOL').CLASS
.transform(lambda series: series.nunique()) == 2
I get
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 False
11 False
12 False
13 False
14 False
15 False
The transform works fine for the school C. BUT, if I do:
school_has_spanish = df.groupby('SCHOOL').CLASS.transform(lambda series: series.str.contains('^Spanish$',regex=True))
or
school_has_spanish = df.groupby('SCHOOL').CLASS.transform(lambda series: series=='Spanish')
I get the following result which is not what I was expecting:
0 True
1 True
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 True
11 False
12 False
13 False
14 False
15 False
The transform just does not spread all True's to the other rows of the group. Result I was expecting:
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 False
8 False
9 False
10 True
11 True
12 True
13 True
14 True
15 True
Any help is appreciated.
Check any with contains
df.CLASS.str.contains('Spanish').groupby(df.SCHOOL).transform('any')
Out[230]:
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 False
8 False
9 False
10 True
11 True
12 True
13 True
14 True
Name: CLASS, dtype: bool

Pandas rolling window - Mark values

In a pandas dataframe, I want to filter the rows where some column is stable within 10.0 units.
def abs_delta_fn(window):
x = window[0]
for y in window[1:]:
if abs(x-y) > 10.0:
return False
return True
df['filter']= df['column'].rolling(5, min_periods=5).apply(abs_delta)
So, if a have a df like this
1 0
2 20
3 40
4 40
5 40
6 40
7 40
8 90
9 120
10 120
applying the rolling window I get:
1 0 nan
2 20 nan
3 40 nan
4 40 nan
5 40 False
6 40 False
7 40 True
8 90 False
9 120 False
10 120 False
How can I get this in a smart way?
1 0 nan (or False)
2 20 nan (or False)
3 40 True
4 40 True
5 40 True
6 40 True
7 40 True
8 90 False
9 120 False
10 120 False
IIUC, you already know rolling, just adding apply after that , the key here is .iloc[::-1], cause rolling is from the current row looking up(backward), but you need forward
s=df.x.iloc[::-1].rolling(5,min_periods=5).apply(lambda x : (abs((x-x[0]))<10).all())
df.loc[df.index.difference(sum([list(range(x, x+5))for x in s[s==1].index.values],[]))]
Out[1119]:
x
1 0
2 20
8 90
9 120
10 120

Categories

Resources