I have a dataframe that looks like this:
Answers all_answers Score
0 0.0 0 72
1 0.0 0 73
2 0.0 0 74
3 1.0 1 1
4 -1.0 1 2
5 1.0 1 3
6 -1.0 1 4
7 1.0 1 5
8 0.0 0 1
9 0.0 0 2
10 -1.0 1 1
11 0.0 0 1
12 0.0 0 2
13 1.0 1 1
14 0.0 0 1
15 0.0 0 2
16 1.0 1 1
The first column is a signal that the sign has changed in the calculation flow
The second one is I just removed the minus from the first one
The third is an internal account for the second column - how much was one and how much was zero
I want to add a fourth column to it that would show me only those units that went in a row for example 5 times while observing the sign of the first column.
To get something like this
Answers all_answers Score New
0 0.0 0 72 0
1 0.0 0 73 0
2 0.0 0 74 0
3 1.0 1 1 1
4 -1.0 1 2 -1
5 1.0 1 3 1
6 -1.0 1 4 -1
7 1.0 1 5 1
8 0.0 0 1 0
9 0.0 0 2 0
10 -1.0 1 1 0
11 0.0 0 1 0
12 0.0 0 2 0
13 1.0 1 1 0
14 0.0 0 1 0
15 0.0 0 2 0
16 1.0 1 1 0
17 0.0 0 1 0
Is it possible to do this by Pandas ?
You can use:
# group by consecutive 0/1
g = df['all_answers'].ne(df['all_answers'].shift()).cumsum()
# get size of each group and compare to threshold
m = df.groupby(g)['all_answers'].transform('size').ge(5)
# mask small groups
df['New'] = df['Answers'].where(m, 0)
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0
A faster way (with regex):
import pandas as pd
import re
def repl5(m):
return '5' * len(m.group())
s = df['all_answers'].astype(str).str.cat()
d = re.sub('(?:1{5,})', repl5, s)
d = [x=='5' for x in list(d)]
df['New'] = df['Answers'].where(d, 0.0)
df
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0
Here is the raw data:
date name score
0 2021-01-02 A 100
1 2021-01-03 A 120
2 2021-01-04 A 130
3 2021-01-05 A 115
4 2021-01-06 A 120
5 2021-01-07 A 70
6 2021-01-08 A 60
7 2021-01-09 A 30
8 2021-01-10 A 10
9 2021-01-11 A 100
10 2021-01-02 B 50
11 2021-01-03 B 40
12 2021-01-04 B 80
13 2021-01-05 B 115
14 2021-01-06 B 100
15 2021-01-07 B 50
16 2021-01-08 B 20
17 2021-01-09 B 40
18 2021-01-10 B 120
19 2021-01-11 B 20
20 2021-01-02 C 80
21 2021-01-03 C 100
22 2021-01-04 C 120
23 2021-01-05 C 115
24 2021-01-06 C 90
25 2021-01-07 C 80
26 2021-01-08 C 150
27 2021-01-09 C 200
28 2021-01-10 C 30
29 2021-01-11 C 40
I would like to get the following output, with a new column to calculate trailing 3-day average for each name. Besides, I would love to add some new columns doing logical calculation like df.score.shift(1) <= 100.
date name score 3_day_average previous_score<=100
0 2021-01-02 A 100 NaN False
1 2021-01-03 A 120 NaN True
2 2021-01-04 A 130 116.666667 False
3 2021-01-05 A 115 121.666667 False
4 2021-01-06 A 120 121.666667 False
5 2021-01-07 A 70 101.666667 False
6 2021-01-08 A 60 83.333333 True
7 2021-01-09 A 30 53.333333 True
8 2021-01-10 A 10 33.333333 True
9 2021-01-11 A 100 46.666667 True
10 2021-01-02 B 50 NaN False
11 2021-01-03 B 40 NaN True
12 2021-01-04 B 80 56.666667 True
13 2021-01-05 B 115 78.333333 True
14 2021-01-06 B 100 98.333333 False
15 2021-01-07 B 50 88.333333 True
16 2021-01-08 B 20 56.666667 True
17 2021-01-09 B 40 36.666667 True
18 2021-01-10 B 120 60.000000 True
19 2021-01-11 B 20 60.000000 False
20 2021-01-02 C 80 NaN False
21 2021-01-03 C 100 NaN True
22 2021-01-04 C 120 100.000000 True
23 2021-01-05 C 115 111.666667 False
24 2021-01-06 C 90 108.333333 False
25 2021-01-07 C 80 95.000000 True
26 2021-01-08 C 150 106.666667 True
27 2021-01-09 C 200 143.333333 False
28 2021-01-10 C 30 126.666667 False
29 2021-01-11 C 40 90.000000 True
I'm now using df.groupby('name') with df.apply a function, How could I improve the execution time using alternatives? Thanks in advance!
Use rolling after groupby first and then DataFrameGroupBy.shift:
df['3_day_average'] = (df.groupby('name')['score']
.rolling(3)
.mean()
.reset_index(level=0, drop=True))
df['previous_score<=100'] = df.groupby('name')['score'].shift() <= 100
print (df.head(15))
date name score 3_day_average previous_score<=100
0 2021-01-02 A 100 NaN False
1 2021-01-03 A 120 NaN True
2 2021-01-04 A 130 116.666667 False
3 2021-01-05 A 115 121.666667 False
4 2021-01-06 A 120 121.666667 False
5 2021-01-07 A 70 101.666667 False
6 2021-01-08 A 60 83.333333 True
7 2021-01-09 A 30 53.333333 True
8 2021-01-10 A 10 33.333333 True
9 2021-01-11 A 100 46.666667 True
10 2021-01-02 B 50 NaN False
11 2021-01-03 B 40 NaN True
12 2021-01-04 B 80 56.666667 True
13 2021-01-05 B 115 78.333333 True
14 2021-01-06 B 100 98.333333 False
data=[(0 ,'2021-01-02','A',100),
(1 ,'2021-01-03','A',120),
(2 ,'2021-01-04','A',130),
(3 ,'2021-01-05','A',115),
(4 ,'2021-01-06','A',120),
(5 ,'2021-01-07','A', 70),
(6 ,'2021-01-08','A', 60),
(7 ,'2021-01-09','A', 30),
(8 ,'2021-01-10','A', 10),
(9 ,'2021-01-11','A',100),
(10 ,'2021-01-02','B', 50),
(11 ,'2021-01-03','B', 40),
(12 ,'2021-01-04','B', 80),
(13 ,'2021-01-05','B',115),
(14 ,'2021-01-06','B',100),
(15 ,'2021-01-07','B', 50),
(16 ,'2021-01-08','B', 20),
(17 ,'2021-01-09','B', 40),
(18 ,'2021-01-10','B',120),
(19 ,'2021-01-11','B', 20),
(20 ,'2021-01-02','C', 80),
(21 ,'2021-01-03','C',100),
(22 ,'2021-01-04','C',120),
(23 ,'2021-01-05','C',115),
(24 ,'2021-01-06','C', 90),
(25 ,'2021-01-07','C', 80),
(26 ,'2021-01-08','C',150),
(27 ,'2021-01-09','C',200),
(28 ,'2021-01-10','C', 30),
(29 ,'2021-01-11','C', 40)]
header=['id','date','name','score']
df=pd.DataFrame(data,columns=header)
df['3d_rolling_avg'] = df.iloc[:,3].rolling(
window=3,
center=False
).mean()
df['shift']=df.apply(lambda x: x.shift(1))['score']
df['prev_score_lessthan_100']=df['shift'].apply(lambda x: True if (x <=100) & (x != None) else False)
print(df)
output:
id date name score 3d_rolling_avg shift prev_score_lessthan_100
0 0 2021-01-02 A 100 NaN NaN False
1 1 2021-01-03 A 120 NaN 100.0 True
2 2 2021-01-04 A 130 116.666667 120.0 False
3 3 2021-01-05 A 115 121.666667 130.0 False
4 4 2021-01-06 A 120 121.666667 115.0 False
5 5 2021-01-07 A 70 101.666667 120.0 False
6 6 2021-01-08 A 60 83.333333 70.0 True
7 7 2021-01-09 A 30 53.333333 60.0 True
8 8 2021-01-10 A 10 33.333333 30.0 True
9 9 2021-01-11 A 100 46.666667 10.0 True
10 10 2021-01-02 B 50 53.333333 100.0 True
11 11 2021-01-03 B 40 63.333333 50.0 True
12 12 2021-01-04 B 80 56.666667 40.0 True
13 13 2021-01-05 B 115 78.333333 80.0 True
14 14 2021-01-06 B 100 98.333333 115.0 False
15 15 2021-01-07 B 50 88.333333 100.0 True
16 16 2021-01-08 B 20 56.666667 50.0 True
17 17 2021-01-09 B 40 36.666667 20.0 True
18 18 2021-01-10 B 120 60.000000 40.0 True
19 19 2021-01-11 B 20 60.000000 120.0 False
20 20 2021-01-02 C 80 73.333333 20.0 True
21 21 2021-01-03 C 100 66.666667 80.0 True
22 22 2021-01-04 C 120 100.000000 100.0 True
23 23 2021-01-05 C 115 111.666667 120.0 False
24 24 2021-01-06 C 90 108.333333 115.0 False
25 25 2021-01-07 C 80 95.000000 90.0 True
26 26 2021-01-08 C 150 106.666667 80.0 True
27 27 2021-01-09 C 200 143.333333 150.0 False
28 28 2021-01-10 C 30 126.666667 200.0 False
29 29 2021-01-11 C 40 90.000000 30.0 True
I am trying to find when a price value is cross above a high, I can find the high but when I compare it to current price it gives me all 1
my code :
peak = df[(df[‘price’] > df[‘price’].shift(-1)) & (df[‘price’] > df[‘price’].shift(1))]
df[‘peak’] = peak
df[‘breakout’] = df[‘price’] > df[‘peak’]
print(df)
out :
price
peak
breakout
1
2
NaN
1
2
2
NaN
1
3
4
NaN
1
4
5
NaN
1
5
6
6.0
1
6
5
NaN
1
7
4
NaN
1
8
3
NaN
1
9
12
12.0
1
10
10
NaN
1
11
50
NaN
1
12
100
NaN
1
13
110
110
1
14
84
NaN
1
expect:
price
peak
high
breakout
1
2
NaN
0
0
2
2
NaN
0
0
3
4
NaN
0
0
4
5
NaN
0
0
5
6
6.0
1
1
6
5
NaN
0
0
7
4
NaN
0
0
8
3
NaN
0
0
9
12
12.0
1
1
10
10
NaN
0
0
11
50
NaN
0
1
12
100
NaN
0
1
13
110
110
1
1
14
84
NaN
0
0
with fillna :
price peak look breakout
0 2 NaN NaN False
1 4 NaN NaN False
2 5 NaN NaN False
3 6 6.0 6.0 False
4 5 NaN 6.0 False
5 4 NaN 6.0 False
6 3 NaN 6.0 False
7 12 12.0 12.0 False ----> this should be True because it it higher than 6 and it also the high for shift(-1) and shift(1)
8 10 NaN 12.0 False
9 50 NaN 12.0 True
10 100 100.0 100.0 False
11 40 NaN 100.0 False
12 45 45.0 45.0 False
13 30 NaN 45.0 False
14 200 NaN 45.0 True
Try with pandas.DataFrame.fillna:
df["breakout"] = df["price"] >= df["peak"].fillna(method = "ffill")
If you want it with 1s and 0s add the line:
df["breakout"] = df["breakout"].replace([True, False],[1,0])
Note that df["peak"].fillna(method = "ffill") returns:
0 NaN
1 NaN
2 NaN
3 NaN
4 6.0
5 6.0
6 6.0
7 6.0
8 12.0
9 12.0
10 12.0
11 12.0
12 110.0
13 110.0
Name: peak, dtype: float64
So you can compare it easily with the price column.
I have a dataframe with a date+time and a label, which I want to reshape into date (/month) columns with label frequencies for that month:
date_time label
1 2017-09-26 17:08:00 0
3 2017-10-03 13:27:00 2
4 2017-10-04 19:04:00 0
11 2017-10-11 18:28:00 1
27 2017-10-13 11:22:00 0
28 2017-10-13 21:43:00 0
39 2017-10-16 14:43:00 0
40 2017-10-16 21:39:00 0
65 2017-10-21 21:53:00 2
...
98 2017-11-01 20:08:00 3
99 2017-11-02 12:00:00 3
100 2017-11-02 12:01:00 2
109 2017-11-02 12:03:00 3
110 2017-11-03 22:24:00 0
111 2017-11-04 09:05:00 3
112 2017-11-06 12:36:00 3
113 2017-11-06 12:48:00 2
128 2017-11-07 15:20:00 2
143 2017-11-10 16:36:00 3
144 2017-11-10 20:00:00 0
145 2017-11-10 20:02:00 0
I group the label frequency by month with this line (thanks partially to this post):
df2 = df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count()
which outputs
date_time label
2017-09-30 0 1
2017-10-31 0 6
1 1
2 8
3 2
2017-11-30 0 25
4 2
5 1
2 4
3 11
2017-12-31 0 14
5 3
2 5
3 7
2018-01-31 0 8
4 1
5 1
2 2
3 3
but, as mentioned before, I would like to get the data by month/date columns:
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1
currently I can do sort of divide the data with
pd.concat([df2[m] for m in df2.index.levels[0]], axis=1).fillna(0)
but I lose the column names:
label label label label label
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
So I have to do a longer version where I generate a series, rename it, concatenate and then fill in the blanks:
m_list = []
for m in df2.index.levels[0]:
m_labels = df2[m]
m_labels = m_labels.rename(m)
m_list.append(m_labels)
pd.concat(m_list, axis=1).fillna(0)
resulting in
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
Is there a shorter/more elegant way to get to this last datagrame from my original one?
You just need unstack here
df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count().unstack(0,fill_value=0)
Out[235]:
date_time 2017-09-30 2017-10-31 2017-11-30
label
0 1 5 3
1 0 1 0
2 0 2 3
3 0 0 6
Base on your groupby output
s.unstack(0,fill_value=0)
Out[240]:
date_time 2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
label
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1
I have the following df of values for various slices across time:
date A B C
0 2016-01-01 5 7 2
1 2016-01-02 6 12 15
...
2 2016-01-08 9 5 16
...
3 2016-12-24 5 11 13
4 2016-12-31 3 52 22
I would like to create a new dataframe that calculates the w-w change in each slice, by date. For example, I want the new table to be blank for all slices from jan 1 - jan 7. I want the value of jan 8 to be the jan 8 value for the given slice minus the value of the jan 1 value of that slice. I then want the value of jan 9 to be the jan 9 value for the given slice minus the value of the jan 2 slice. So and so forth, all the way down.
The example table would look like this:
date A B C
0 2016-01-01 0 0 0
1 2016-01-02 0 0 0
...
2 2016-01-08 4 -2 14
...
3 2016-12-24 4 12 2
4 2016-12-31 -2 41 9
You may assume the offset is ALWAYS 7. In other words, there are no missing dates.
#Unatiel's answer is correct in this case, where there are no missing dates, and should be accepted.
But I wanted to post a modification here for cases with missing dates, for anyone interested. From the docs:
The shift method accepts a freq argument which can accept a
DateOffset class or other timedelta-like object or also a offset
alias
from pandas.tseries.offsets import Week
res = ((df - df.shift(1, freq=Week()).reindex(df.index))
.fillna(value=0)
.astype(int))
print(res)
A B
date
2016-01-01 0 0
2016-01-02 0 0
2016-01-03 0 0
2016-01-04 0 0
2016-01-05 0 0
2016-01-06 0 0
2016-01-07 0 0
2016-01-08 31 46
2016-01-09 4 20
2016-01-10 -51 -65
2016-01-11 56 5
2016-01-12 -51 24
.. ..
2016-01-20 34 -30
2016-01-21 -28 19
2016-01-22 24 8
2016-01-23 -28 -46
2016-01-24 -11 -60
2016-01-25 -34 -7
2016-01-26 -12 -28
2016-01-27 -41 42
2016-01-28 -2 48
2016-01-29 35 -51
2016-01-30 -8 62
2016-01-31 -6 -9
If we know offset is always 7 then use shift(), here is a quick example showing how it works :
df = pandas.DataFrame({'x': range(30)})
df.shift(7)
x
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 0.0
8 1.0
9 2.0
10 3.0
11 4.0
12 5.0
...
So with this you can do :
df - df.shift(7)
x
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 7.0
8 7.0
...
In your case, don't forget to set_index('date') before.