Following the question I asked: Combine similar rows to one row in python dataframe1
I have the original data below, and have 2 questions want to ask :
yyyymmdd hr ariel cat kiki mmax vicky gaolie shiu nick ck
0 2015-12-27 9 0 0 0 0 0 0 0 23 0
1 2015-12-27 10 0 0 0 0 0 0 0 2 0
2 2015-12-27 11 0 0 0 0 0 0 0 20 0
3 2015-12-27 12 0 0 0 0 0 0 0 4 0
4 2015-12-27 17 0 0 0 0 0 0 0 2 0
5 2015-12-27 19 1 0 0 0 0 0 0 0 0
6 2015-12-28 8 0 8 0 0 0 0 0 0 0
7 2015-12-28 9 11 11 0 0 0 0 19 0 0
8 2015-12-28 10 85 13 0 0 2 0 15 0 0
9 2015-12-28 11 2 11 0 0 2 0 14 0 0
10 2015-12-28 12 2 20 0 4 0 0 10 0 0
11 2015-12-28 13 8 9 0 9 3 0 9 0 0
12 2015-12-28 14 4 10 0 8 0 0 22 0 0
13 2015-12-28 15 3 3 0 2 0 0 16 0 0
14 2015-12-28 16 14 5 1 1 0 0 19 0 0
15 2015-12-28 17 15 1 2 0 0 0 19 0 0
16 2015-12-28 18 0 0 0 6 0 0 0 0 0
17 2015-12-28 19 0 0 0 5 0 0 0 0 0
18 2015-12-28 20 0 0 0 1 0 0 0 0 0
how can I "fill" the "hr" index of the DataFrame? The result should be something like this:
yyyymmdd hr ariel cat kiki mmax vicky gaolie shiu nick ck
12/27/15 8 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 9 0 0 0 0 0 0 0 23 0
12/27/15 10 0 0 0 0 0 0 0 2 0
12/27/15 11 0 0 0 0 0 0 0 20 0
12/27/15 12 0 0 0 0 0 0 0 4 0
12/27/15 13 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 14 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 15 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 16 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 17 0 0 0 0 0 0 0 2 0
12/27/15 18 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 19 1 0 0 0 0 0 0 0 0
12/27/15 20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/28/15 8 0 8 0 0 0 0 0 0 0
12/28/15 9 11 11 0 0 0 0 19 0 0
12/28/15 10 85 13 0 0 2 0 15 0 0
12/28/15 11 2 11 0 0 2 0 14 0 0
12/28/15 12 2 20 0 4 0 0 10 0 0
12/28/15 13 8 9 0 9 3 0 9 0 0
12/28/15 14 4 10 0 8 0 0 22 0 0
12/28/15 15 3 3 0 2 0 0 16 0 0
12/28/15 16 14 5 1 1 0 0 19 0 0
12/28/15 17 15 1 2 0 0 0 19 0 0
12/28/15 18 0 0 0 6 0 0 0 0 0
12/28/15 19 0 0 0 5 0 0 0 0 0
12/28/15 20 0 0 0 1 0 0 0 0 0
how can I plot the line charts based on columns and hr ?
x-axis = columns , i.e. : ariel ,cat, kiki...
y-axis = hr, i.e. : 8,9,10...20
every subplot represents one date (i.e. 2015-12-27, 2015-12-28..)
and here is the framework of the plot I want to get :
please click here for the picture
You can convert yyyymmdd to datetime, combine with the hr information and then resample to hourly frequency like so:
df.yyyymmdd = pd.to_datetime(df.yyyymmdd)
df.yyyymmdd = df.apply(lambda x: x.yyyymmdd + pd.DateOffset(hours = x.hr), axis=1)
df.set_index('yyyymmdd', inplace=True)
df = df.resample('H')
to get:
hr ariel cat kiki mmax vicky gaolie shiu nick ck
yyyymmdd
2015-12-27 09:00:00 9 0 0 0 0 0 0 0 23 0
2015-12-27 10:00:00 10 0 0 0 0 0 0 0 2 0
2015-12-27 11:00:00 11 0 0 0 0 0 0 0 20 0
2015-12-27 12:00:00 12 0 0 0 0 0 0 0 4 0
2015-12-27 13:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 14:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 15:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 16:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 17:00:00 17 0 0 0 0 0 0 0 2 0
2015-12-27 18:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 19:00:00 19 1 0 0 0 0 0 0 0 0
2015-12-27 20:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 21:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 22:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 23:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 01:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 02:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 03:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 04:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 05:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 06:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 07:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 08:00:00 8 0 8 0 0 0 0 0 0 0
2015-12-28 09:00:00 9 11 11 0 0 0 0 19 0 0
2015-12-28 10:00:00 10 85 13 0 0 2 0 15 0 0
2015-12-28 11:00:00 11 2 11 0 0 2 0 14 0 0
2015-12-28 12:00:00 12 2 20 0 4 0 0 10 0 0
2015-12-28 13:00:00 13 8 9 0 9 3 0 9 0 0
2015-12-28 14:00:00 14 4 10 0 8 0 0 22 0 0
2015-12-28 15:00:00 15 3 3 0 2 0 0 16 0 0
2015-12-28 16:00:00 16 14 5 1 1 0 0 19 0 0
2015-12-28 17:00:00 17 15 1 2 0 0 0 19 0 0
2015-12-28 18:00:00 18 0 0 0 6 0 0 0 0 0
2015-12-28 19:00:00 19 0 0 0 5 0 0 0 0 0
2015-12-28 20:00:00 20 0 0 0 1 0 0 0 0 0
You could plot the result as follows - assuming that you are looking for one subplot for each date and column:
for d, data in df.groupby(pd.TimeGrouper('D')):
data.plot.line(figsize=(10, 20), sharey=True)
plt.gcf().savefig('cats {}.png'.format(d), bbox_inches='tight')
plt.close()
to get:
Related
I have the following panel dataset. "winner" =1 if in period (date), someone is a winner, zero if loser.
ID date winner
A 2017Q4 NaN
A 2018Q4 1
A 2019Q4 0
A 2020Q4 0
A 2021Q4 1
B 2017Q4 NaN
B 2018Q4 1
B 2019Q4 1
B 2020Q4 0
B 2021Q4 0
C 2017Q4 NaN
C 2018Q4 0
C 2019Q4 0
C 2020Q4 0
C 2021Q4 0
D 2017Q4 NaN
D 2018Q4 0
D 2019Q4 1
D 2020Q4 1
D 2021Q4 1
I want to create four dummy variables, WW =1 if someone is winner in two consecutive periods. LL=1 if loser in two consecutive periods. WL if winner in period 1 and loser the next period, and LW vice versa.
UPDATE
when i apply the answers below i get the following
ID date winner WW LL WL LW
A 2017Q4 NaN
A 2018Q4 1 0 0 0 0
A 2019Q4 0 0 0 1 0
A 2020Q4 0 0 1 0 0
A 2021Q4 1 0 0 0 1
B 2017Q4 NaN
B 2018Q4 1 0 0 0 0
B 2019Q4 1 1 0 0 0
B 2020Q4 0 0 0 1 0
B 2021Q4 0 0 1 0 0
C 2017Q4 NaN
C 2018Q4 0 0 0 0 0
C 2019Q4 0 0 1 0 0
C 2020Q4 0 0 1 0 0
C 2021Q4 0 0 1 0 0
D 2017Q4 NaN
D 2018Q4 0 0 0 0 0
D 2019Q4 1 0 0 0 1
D 2020Q4 1 1 0 0 0
D 2021Q4 1 1 0 0 0
How do i make sure I get the NaN when the previous value is NaN?
desired output
ID date winner WW LL WL LW
A 2017Q4 NaN
A 2018Q4 1 NaN NaN NaN NaN
A 2019Q4 0 0 0 1 0
A 2020Q4 0 0 1 0 0
A 2021Q4 1 0 0 0 1
B 2017Q4 NaN
B 2018Q4 1 NaN NaN NaN NaN
B 2019Q4 1 1 0 0 0
B 2020Q4 0 0 0 1 0
B 2021Q4 0 0 1 0 0
C 2017Q4 NaN
C 2018Q4 0 NaN NaN NaN NaN
C 2019Q4 0 0 1 0 0
C 2020Q4 0 0 1 0 0
C 2021Q4 0 0 1 0 0
D 2017Q4 NaN
D 2018Q4 0 NaN NaN NaN NaN
D 2019Q4 1 0 0 0 1
D 2020Q4 1 1 0 0 0
D 2021Q4 1 1 0 0 0
How to do this in the most simple way?
Here's one way: Use groupby.shift to get the previous record; then use numpy.select to assign values, which you use get_dummies to convert to dummy variables:
import numpy as np
df['previous'] = df.groupby('ID')['winner'].shift()
tmp = df[['previous','winner']]
dummy_vars = ['WW','LL','WL', 'LW']
out = (df.join(pd.get_dummies(np.select([tmp.eq(1).all(1),
tmp.eq(0).all(1),
tmp.eq([1,0]).all(1),
tmp.eq([0,1]).all(1)],
dummy_vars, ''))[dummy_vars+['']]
.mask(df['previous'].isna(), ''))
.drop(columns=['previous','']))
Output:
ID date winner WW LL WL LW
0 A 2018Q4 1
1 A 2019Q4 0 0 0 1 0
2 A 2020Q4 0 0 1 0 0
3 A 2021Q4 1 0 0 0 1
4 B 2018Q4 1
5 B 2019Q4 1 1 0 0 0
6 B 2020Q4 0 0 0 1 0
7 B 2021Q4 0 0 1 0 0
8 C 2018Q4 0
9 C 2019Q4 0 0 1 0 0
10 C 2020Q4 0 0 1 0 0
11 C 2021Q4 0 0 1 0 0
12 D 2018Q4 0
13 D 2019Q4 1 0 0 0 1
14 D 2020Q4 1 1 0 0 0
15 D 2021Q4 1 1 0 0 0
map 1 and 0 to "W" and "L"
get the 2-period streak
get_dummies for the "streak"
join to original DataFrame ignoring the first row of each ID
wins = df["winner"].fillna(0).map({1:"W",0:"L"})
streaks = wins.shift() + wins
other = pd.get_dummies(streaks.where(df["ID"].eq(df["ID"].shift())))
output = df.join(other.where(df["ID"].duplicated()&df["winner"].shift().notna()))
>>> output
ID date winner LL LW WL WW
0 A 2017Q4 NaN NaN NaN NaN NaN
1 A 2018Q4 1.0 NaN NaN NaN NaN
2 A 2019Q4 0.0 0.0 0.0 1.0 0.0
3 A 2020Q4 0.0 1.0 0.0 0.0 0.0
4 A 2021Q4 1.0 0.0 1.0 0.0 0.0
5 B 2017Q4 NaN NaN NaN NaN NaN
6 B 2018Q4 1.0 NaN NaN NaN NaN
7 B 2019Q4 1.0 0.0 0.0 0.0 1.0
8 B 2020Q4 0.0 0.0 0.0 1.0 0.0
9 B 2021Q4 0.0 1.0 0.0 0.0 0.0
10 C 2017Q4 NaN NaN NaN NaN NaN
11 C 2018Q4 0.0 NaN NaN NaN NaN
12 C 2019Q4 0.0 1.0 0.0 0.0 0.0
13 C 2020Q4 0.0 1.0 0.0 0.0 0.0
14 C 2021Q4 0.0 1.0 0.0 0.0 0.0
15 D 2017Q4 NaN NaN NaN NaN NaN
16 D 2018Q4 0.0 NaN NaN NaN NaN
17 D 2019Q4 1.0 0.0 1.0 0.0 0.0
18 D 2020Q4 1.0 0.0 0.0 0.0 1.0
19 D 2021Q4 1.0 0.0 0.0 0.0 1.0
I am trying to find when a price value is cross above a high, I can find the high but when I compare it to current price it gives me all 1
my code :
peak = df[(df[‘price’] > df[‘price’].shift(-1)) & (df[‘price’] > df[‘price’].shift(1))]
df[‘peak’] = peak
df[‘breakout’] = df[‘price’] > df[‘peak’]
print(df)
out :
price
peak
breakout
1
2
NaN
1
2
2
NaN
1
3
4
NaN
1
4
5
NaN
1
5
6
6.0
1
6
5
NaN
1
7
4
NaN
1
8
3
NaN
1
9
12
12.0
1
10
10
NaN
1
11
50
NaN
1
12
100
NaN
1
13
110
110
1
14
84
NaN
1
expect:
price
peak
high
breakout
1
2
NaN
0
0
2
2
NaN
0
0
3
4
NaN
0
0
4
5
NaN
0
0
5
6
6.0
1
1
6
5
NaN
0
0
7
4
NaN
0
0
8
3
NaN
0
0
9
12
12.0
1
1
10
10
NaN
0
0
11
50
NaN
0
1
12
100
NaN
0
1
13
110
110
1
1
14
84
NaN
0
0
with fillna :
price peak look breakout
0 2 NaN NaN False
1 4 NaN NaN False
2 5 NaN NaN False
3 6 6.0 6.0 False
4 5 NaN 6.0 False
5 4 NaN 6.0 False
6 3 NaN 6.0 False
7 12 12.0 12.0 False ----> this should be True because it it higher than 6 and it also the high for shift(-1) and shift(1)
8 10 NaN 12.0 False
9 50 NaN 12.0 True
10 100 100.0 100.0 False
11 40 NaN 100.0 False
12 45 45.0 45.0 False
13 30 NaN 45.0 False
14 200 NaN 45.0 True
Try with pandas.DataFrame.fillna:
df["breakout"] = df["price"] >= df["peak"].fillna(method = "ffill")
If you want it with 1s and 0s add the line:
df["breakout"] = df["breakout"].replace([True, False],[1,0])
Note that df["peak"].fillna(method = "ffill") returns:
0 NaN
1 NaN
2 NaN
3 NaN
4 6.0
5 6.0
6 6.0
7 6.0
8 12.0
9 12.0
10 12.0
11 12.0
12 110.0
13 110.0
Name: peak, dtype: float64
So you can compare it easily with the price column.
I have several different data frames, that I need to drop certain rows from. Each data frame has the same sequence of rows but located in different areas
Summary Results Report Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3
0 DEM President NaN NaN NaN NaN
1 Vote For 1 NaN NaN NaN NaN
2 NaN NaN Ballots By NaN Election
3 TOTAL NaN NaN Early Voting NaN
4 NaN NaN Mail NaN Day
5 Tom Steyer NaN 0 0 0 0
6 Andrew Yang NaN 0 0 0 0
7 John K. Delaney NaN 0 0 0 0
8 Cory Booker NaN 0 0 0 0
9 Michael R. Bloomberg NaN 4 1 1 2
10 Julian Castro NaN 0 0 0 0
11 Elizabeth Warren NaN 1 0 1 0
12 Marianne Williamson NaN 0 0 0 0
13 Deval Patrick NaN 0 0 0 0
14 Robby Wells NaN 0 0 0 0
15 Amy Klobuchar NaN 3 1 2 0
16 Tulsi Gabbard NaN 0 0 0 0
17 Michael Bennet NaN 0 0 0 0
18 Bernie Sanders NaN 4 0 1 3
19 Pete Buttigieg NaN 0 0 0 0
20 Joseph R. Biden 21.0 0 3 18
21 Roque "Rocky" De La Fuente NaN 0 0 0 0
22 Total Votes Cast 33.0 2 8 23
Summary Results Report Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 DEM US Senator NaN NaN NaN NaN NaN NaN
1 Vote For 1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN Ballots By NaN Election NaN
3 TOTAL NaN NaN NaN Early Voting NaN NaN
4 NaN NaN NaN Mail NaN Day NaN
5 Jack Daniel Foster, Jr. 4.0 NaN 0 0 4 NaN
6 Mary (MJ) Hegar 4.0 NaN 1 3 0 NaN
7 Amanda K. Edwards 4.0 NaN 1 1 2 NaN
8 D. R. Hunter 1.0 NaN 0 0 1 NaN
9 Michael Cooper 3.0 NaN 0 0 3 NaN
10 Chris Bell 1.0 NaN 0 0 1 NaN
11 Royce West 3.0 NaN 0 0 3 NaN
12 Cristina Tzintzun Ramirez 5.0 NaN 0 3 2 NaN
13 Victor Hugo Harris 1.0 NaN 0 0 1 NaN
14 Sema Hernandez 1.0 NaN 0 0 1 NaN
15 Adrian Ocegueda 0.0 NaN 0 0 0 NaN
16 Annie "Mama" Garcia 3.0 NaN 0 1 2 NaN
17 Total Votes Cast 30 NaN NaN 2 8 20 NaN
18 DEM US Representative, Dist 1 NaN NaN NaN NaN NaN NaN
19 Vote For 1 NaN NaN NaN NaN NaN NaN
20 NaN NaN NaN Ballots By NaN Election NaN
21 TOTAL NaN NaN NaN Early Voting NaN NaN
22 NaN NaN NaN Mail NaN Day NaN
23 Hank Gilbert 26 NaN NaN 1 6 19 NaN
24 Total Votes Cast 26 NaN NaN 1 6 19 NaN
What I want to remove is the row that contains Vote for 1 in the first column, as well as the following 3 rows. The problem is that they can show in multiple areas, and in occasion multiple times (such as the second data frame). What I have seems to be working, in the aspect that it removes the required rows, however, at the end, it will give me a key error which tells me that it is re-looping through without any data.
for x in range(len(df)):
if 'Vote For 1' in str(df.iloc[:,0][x]):
y = x+3
df = df.drop(df.loc[x:y].index)
df.reset_index(drop=True,inplace=True)
df.index.name=None
print(df)
the code produces the following output:
Summary Results Report Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 DEM US Senator NaN NaN NaN NaN NaN NaN
1 Jack Daniel Foster, Jr. 4.0 NaN 0 0 4 NaN
2 Mary (MJ) Hegar 4.0 NaN 1 3 0 NaN
3 Amanda K. Edwards 4.0 NaN 1 1 2 NaN
4 D. R. Hunter 1.0 NaN 0 0 1 NaN
5 Michael Cooper 3.0 NaN 0 0 3 NaN
6 Chris Bell 1.0 NaN 0 0 1 NaN
7 Royce West 3.0 NaN 0 0 3 NaN
8 Cristina Tzintzun Ramirez 5.0 NaN 0 3 2 NaN
9 Victor Hugo Harris 1.0 NaN 0 0 1 NaN
10 Sema Hernandez 1.0 NaN 0 0 1 NaN
11 Adrian Ocegueda 0.0 NaN 0 0 0 NaN
12 Annie "Mama" Garcia 3.0 NaN 0 1 2 NaN
13 Total Votes Cast 30 NaN NaN 2 8 20 NaN
14 DEM US Representative, Dist 1 NaN NaN NaN NaN NaN NaN
15 Hank Gilbert 26 NaN NaN 1 6 19 NaN
16 Total Votes Cast 26 NaN NaN 1 6 19 NaN
It errors out at the end with KeyError: 17. Any advice is greatly appreciated.
####EDIT####
I just wanted to give an update on the code that finally solved my problem. I know that it is probably a little robust, but it does work.
remove_strings=['Vote For 1','TOTAL']
remove_strings_list = df.index[df['Summary Results Report'].isin(remove_strings)].tolist()
df = df.drop(df.index[remove_strings_list])
Not exactly sure what your column names are, but is the summary column contains the names and the few names you want to remove, this should work. Else you may have to change the column name accordingly.
strings_to_remove = ['Vote for 1', 'Total', 'NaN']
df[~df.summary.isin(strings_to_remove)]
I have a dataframe like so:
df = pd.DataFrame({'time':['23:59:45','23:49:50','23:59:55','00:00:00','00:00:05','00:00:10','00:00:15'],
'X':[-5,-4,-2,5,6,10,11],
'Y':[3,4,5,9,20,22,23]})
As you can see, the time is formed by hours (string format) and are across midnight. The time is given every 5 seconds!
My goal is however to add empty rows (filled with Nan for examples) so that the time is every second. Finally the column time should be converted as a time stamp and set as index.
Could you please suggest a smart and elegant way to achieve my goal?
Here is what the output should look like:
X Y
time
23:59:45 -5.0 3.0
23:59:46 NaN NaN
23:59:47 NaN NaN
23:59:48 NaN NaN
... ... ...
00:00:10 10.0 22.0
00:00:11 NaN NaN
00:00:12 NaN NaN
00:00:13 NaN NaN
00:00:14 NaN NaN
00:00:15 11.0 23.0
Note: I do not need the dates.
Use to_timedelta with reindex by timedelta_range:
df['time'] = pd.to_timedelta(df['time'])
idx = pd.timedelta_range('0', '23:59:59', freq='S', name='time')
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 00:00:00 5.0 9.0
1 00:00:01 NaN NaN
2 00:00:02 NaN NaN
3 00:00:03 NaN NaN
4 00:00:04 NaN NaN
5 00:00:05 6.0 20.0
6 00:00:06 NaN NaN
7 00:00:07 NaN NaN
8 00:00:08 NaN NaN
9 00:00:09 NaN NaN
If need replace NaNs:
df = df.set_index('time').reindex(idx, fill_value=0).reset_index()
print (df.head(10))
time X Y
0 00:00:00 5 9
1 00:00:01 0 0
2 00:00:02 0 0
3 00:00:03 0 0
4 00:00:04 0 0
5 00:00:05 6 20
6 00:00:06 0 0
7 00:00:07 0 0
8 00:00:08 0 0
9 00:00:09 0 0
Another solution with resample, but is possible some rows are missing in the end:
df = df.set_index('time').resample('S').first()
print (df.tail(10))
X Y
time
23:59:46 NaN NaN
23:59:47 NaN NaN
23:59:48 NaN NaN
23:59:49 NaN NaN
23:59:50 NaN NaN
23:59:51 NaN NaN
23:59:52 NaN NaN
23:59:53 NaN NaN
23:59:54 NaN NaN
23:59:55 -2.0 5.0
EDIT1:
idx1 = pd.timedelta_range('23:59:45', '23:59:59', freq='S', name='time')
idx2 = pd.timedelta_range('0', '00:00:15', freq='S', name='time')
idx = np.concatenate([idx1, idx2])
df['time'] = pd.to_timedelta(df['time'])
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 23:59:45 -5.0 3.0
1 23:59:46 NaN NaN
2 23:59:47 NaN NaN
3 23:59:48 NaN NaN
4 23:59:49 NaN NaN
5 23:59:50 NaN NaN
6 23:59:51 NaN NaN
7 23:59:52 NaN NaN
8 23:59:53 NaN NaN
9 23:59:54 NaN NaN
print (df.tail(10))
time X Y
21 00:00:06 NaN NaN
22 00:00:07 NaN NaN
23 00:00:08 NaN NaN
24 00:00:09 NaN NaN
25 00:00:10 10.0 22.0
26 00:00:11 NaN NaN
27 00:00:12 NaN NaN
28 00:00:13 NaN NaN
29 00:00:14 NaN NaN
30 00:00:15 11.0 23.0
EDIT:
Another solution - change next day to 1 day timedeltas:
df['time'] = pd.to_timedelta(df['time'])
a = pd.to_timedelta(df['time'].diff().dt.days.abs().cumsum().fillna(1).sub(1), unit='d')
df['time'] = df['time'] + a
print (df)
X Y time
0 -5 3 0 days 23:59:45
1 -4 4 0 days 23:49:50
2 -2 5 0 days 23:59:55
3 5 9 1 days 00:00:00
4 6 20 1 days 00:00:05
5 10 22 1 days 00:00:10
6 11 23 1 days 00:00:15
idx = pd.timedelta_range(df['time'].min(), df['time'].max(), freq='S', name='time')
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 23:49:50 -4.0 4.0
1 23:49:51 NaN NaN
2 23:49:52 NaN NaN
3 23:49:53 NaN NaN
4 23:49:54 NaN NaN
5 23:49:55 NaN NaN
6 23:49:56 NaN NaN
7 23:49:57 NaN NaN
8 23:49:58 NaN NaN
9 23:49:59 NaN NaN
print (df.tail(10))
time X Y
616 1 days 00:00:06 NaN NaN
617 1 days 00:00:07 NaN NaN
618 1 days 00:00:08 NaN NaN
619 1 days 00:00:09 NaN NaN
620 1 days 00:00:10 10.0 22.0
621 1 days 00:00:11 NaN NaN
622 1 days 00:00:12 NaN NaN
623 1 days 00:00:13 NaN NaN
624 1 days 00:00:14 NaN NaN
625 1 days 00:00:15 11.0 23.0
I would like to calculate the sum or other calculation with sliding windows.
For example I would like to calculate the sum on the last 10 data point from current position where A is True.
Is there a way to do this ?
With this it didn't return the value that I expect.
I put the expected value and the calculation on the side.
Thank you
In [63]: dt['As'] = pd.rolling_sum( dt.Val[ dt.A == True ], window=10, min_periods=1)
In [64]: dt
Out[64]:
Val A B As
0 1 NaN NaN NaN
1 1 NaN NaN NaN
2 1 NaN NaN NaN
3 1 NaN NaN NaN
4 6 NaN True NaN
5 1 NaN NaN NaN
6 2 True NaN 1 pos 6 = 2
7 1 NaN NaN NaN
8 3 NaN NaN NaN
9 9 True NaN 2 pos 9 + pos 6 = 11
10 1 NaN NaN NaN
11 9 NaN NaN NaN
12 1 NaN NaN NaN
13 1 NaN True NaN
14 1 NaN NaN NaN
15 2 True NaN 3 pos 15 + pos 9 + pos 6 = 13
16 1 NaN NaN NaN
17 8 NaN NaN NaN
18 1 NaN NaN NaN
19 5 True NaN 4 pos 19 + pos 15 = 7
20 1 NaN NaN NaN
21 1 NaN NaN NaN
22 2 NaN NaN NaN
23 1 NaN NaN NaN
24 7 NaN True NaN
25 1 NaN NaN NaN
26 1 NaN NaN NaN
27 1 NaN NaN NaN
28 3 True NaN 5 pos 28 + pos 19 = 8
This almost do it
import numpy as np
import pandas as pd
dt = pd.read_csv('test2.csv')
dt['AVal'] = dt.Val[dt.A == True]
dt['ASum'] = pd.rolling_sum( dt.AVal, window=10, min_periods=1)
dt['ACnt'] = pd.rolling_count( dt.AVal, window=10)
In [4]: dt
Out[4]:
Val A B AVal ASum ACnt
0 1 NaN NaN NaN NaN 0
1 1 NaN NaN NaN NaN 0
2 1 NaN NaN NaN NaN 0
3 1 NaN NaN NaN NaN 0
4 6 NaN True NaN NaN 0
5 1 NaN NaN NaN NaN 0
6 2 True NaN 2 2 1
7 1 NaN NaN NaN 2 1
8 3 NaN NaN NaN 2 1
9 9 True NaN 9 11 2
10 1 NaN NaN NaN 11 2
11 9 NaN NaN NaN 11 2
12 1 NaN NaN NaN 11 2
13 1 NaN True NaN 11 2
14 1 NaN NaN NaN 11 2
15 2 True NaN 2 13 3
16 1 NaN NaN NaN 11 2
17 8 NaN NaN NaN 11 2
18 1 NaN NaN NaN 11 2
19 5 True NaN 5 7 2
20 1 NaN NaN NaN 7 2
21 1 NaN NaN NaN 7 2
22 2 NaN NaN NaN 7 2
23 1 NaN NaN NaN 7 2
24 7 NaN True NaN 7 2
25 1 NaN NaN NaN 5 1
26 1 NaN NaN NaN 5 1
27 1 NaN NaN NaN 5 1
28 3 True NaN 3 8 2
but need to NaN for all the value in ASum and ACount where A is NaN
Is this the way to do it ?
Are you just doing a sum, or is this a simplified example for a more complex problem?
If it's just a sum then you can use a mix of fillna() and the fact that True and False act like 1 and 0 in np.sum:
In [8]: pd.rolling_sum(dt['A'].fillna(False), window=10,
min_periods=1)[dt['A'].fillna(False)]
Out[8]:
6 1
9 2
15 3
19 2
28 2
dtype: float64