Replace nan values with data from previous months - python

I have a DataFrame as follows. This DataFrame contains NAN values. I want to replace nan values with the earlier non nan value in my DataFrame from previous month(s):
date (y-d-m) | value
2022-01-01 | 1
2022-02-01 | 2
2022-03-01 | 3
2022-04-01 | 4
...
2022-01-02 | nan
2022-02-02 | nan
2022-03-02 | nan
2022-04-02 | nan
...
2022-01-03 | nan
2022-02-03 | nan
2022-03-03 | nan
2022-04-03 | nan
Desired outcome
date (y-d-m) | value
2022-01-01 | 1
2022-02-01 | 2
2022-03-01 | 3
2022-04-01 | 4
...
2022-01-02 | 1
2022-02-02 | 2
2022-03-02 | 3
2022-04-02 | 4
...
2022-01-03 | 1
2022-02-03 | 2
2022-03-03 | 3
2022-04-03 | 4
Data:
{'date (y-d-m)': ['2022-01-01', '2022-02-01', '2022-03-01', '2022-04-01',
'2022-01-02', '2022-02-02', '2022-03-02', '2022-04-02',
'2022-01-03', '2022-02-03', '2022-03-03', '2022-04-03'],
'value': [1.0, 2.0, 3.0, 4.0, nan, nan, nan, nan, nan, nan, nan, nan]}

You could convert "date (y-d-m)" column to datetime; then groupby "day" and forward fill with ffill (values from previous months' same day):
df['date (y-d-m)'] = pd.to_datetime(df['date (y-d-m)'], format='%Y-%d-%m')
df['value'] = df.groupby(df['date (y-d-m)'].dt.day)['value'].ffill()
Output:
date (y-d-m) value
0 2022-01-01 1.0
1 2022-01-02 2.0
2 2022-01-03 3.0
3 2022-01-04 4.0
4 2022-02-01 1.0
5 2022-02-02 2.0
6 2022-02-03 3.0
7 2022-02-04 4.0
8 2022-03-01 1.0
9 2022-03-02 2.0
10 2022-03-03 3.0
11 2022-03-04 4.0

Related

Merging the multiple columns

I have a dataframe like this (actual data has 70 columns with timestamp) with Column name as A_Timestamp, BC_Timestamp, DA_Timestamp, CA_Timestamp, B_Values, C_values, D_Values, Q_Values
A_Timestamp
B_Values
2020-11-08 11:15:00
1
2020-11-10 15:34:00
2
BC_Timestamp
C_Values
2020-11-11 12:13:00
8
2020-11-15 02:47:00
4
DA_Timestamp
D_Values
2020-1-13 14:47:00
3
2020-11-9 5:34:00
5
CA_Timestamp
Q_Values
2020-7-18 01:04:00
7
2020-04-10 16:34:00
6
And I want Like this:
| Timestamp | |B_Values| C_values| D_values| Q_Values|
| 2020-11-08 11:15:00 | 1 | Nan | Nan | Nan|
| 2020-11-10 15:34:00 | 2 | Nan | Nan | Nan |
| 2020-11-11 12:13:00 | Nan | 8 | Nan | Nan|
| 2020-11-15 02:47:00 | Nan | 4 | Nan | Nan|
| 2020-1-13 14:47:00 | Nan | Nan | 3 | Nan|
| 2020-11-9 05:34:00 | Nan | Nan | 5 | Nan|
| 2020-7-18 01:04:00 | Nan | Nan | Nan | 7|
I want to merge all the columns ending with 'Timestamp' into one single column. And each timestamp with their respective value in the respective columns.
You can use a renamer for the Timestamp column:
dfs = [df1, df2, df3, df4]
renamer = lambda x: 'Timestamp' if x.endswith('Timestamp') else x
out = pd.concat([d.rename(renamer, axis=1) for d in dfs])
Output:
Timestamp B_Values C_Values D_Values Q_Values
0 2020-11-08 11:15:00 1.0 NaN NaN NaN
1 2020-11-10 15:34:00 2.0 NaN NaN NaN
0 2020-11-11 12:13:00 NaN 8.0 NaN NaN
1 2020-11-15 02:47:00 NaN 4.0 NaN NaN
0 2020-1-13 14:47:00 NaN NaN 3.0 NaN
1 2020-11-9 5:34:00 NaN NaN 5.0 NaN
0 2020-7-18 01:04:00 NaN NaN NaN 7.0
1 2020-04-10 16:34:00 NaN NaN NaN 6.0
alternative
Assuming you have a single DataFrame as input:
A_Timestamp B_Values BC_Timestamp C_Values DA_Timestamp D_Values CA_Timestamp Q_Values
0 2020-11-08 11:15:00 1 2020-11-11 12:13:00 8 2020-1-13 14:47:00 3 2020-7-18 01:04:00 7
1 2020-11-10 15:34:00 2 2020-11-15 02:47:00 4 2020-11-9 5:34:00 5 2020-04-10 16:34:00 6
You can then reshape with a MultiIndex:
m = df.columns.str.endswith('Timestamp')
s = df.columns.to_series().mask(m)
out = (df
.set_axis(pd.MultiIndex.from_arrays(
[s.bfill(), s.fillna('Timestamp')]), axis=1)
.T.stack().unstack(-2).droplevel(0)
)
Output:
B_Values C_Values D_Values Q_Values Timestamp
0 1 NaN NaN NaN 2020-11-08 11:15:00
1 2 NaN NaN NaN 2020-11-10 15:34:00
0 NaN 8 NaN NaN 2020-11-11 12:13:00
1 NaN 4 NaN NaN 2020-11-15 02:47:00
0 NaN NaN 3 NaN 2020-1-13 14:47:00
1 NaN NaN 5 NaN 2020-11-9 5:34:00
0 NaN NaN NaN 7 2020-7-18 01:04:00
1 NaN NaN NaN 6 2020-04-10 16:34:00
Or, if order of the rows doesn't matter:
m = df.columns.str.endswith('Timestamp')
s = df.columns.to_series().mask(m)
(df.set_axis(pd.MultiIndex.from_arrays(
[s.fillna('Timestamp'), s.bfill()]), axis=1)
.stack()
)

count groups of values with aggregated value

I have a dataset like this one:
DateTime
Value
2022-01-01 11:03:45
0
2022-01-01 11:03:50
40
2022-01-01 11:03:55
50
2022-01-01 11:04:00
60
2022-01-01 11:04:05
5
2022-01-01 11:04:10
4
2022-01-01 11:04:15
3
2022-01-01 11:04:20
0
2022-01-01 11:04:25
0
2022-01-01 11:04:30
40
2022-01-01 11:04:35
50
2022-01-01 11:04:40
4
2022-01-01 11:04:45
3
2022-01-01 11:04:50
0
2022-01-02 11:03:45
0
2022-01-02 11:03:50
5
2022-01-02 11:03:55
50
2022-01-02 11:04:00
60
2022-01-02 11:04:05
5
2022-01-02 11:04:10
4
2022-01-02 11:04:15
3
2022-01-02 11:04:20
0
2022-01-02 11:04:25
49
2022-01-02 11:04:30
40
2022-01-02 11:04:35
50
2022-01-02 11:04:40
4
2022-01-02 11:04:45
3
2022-01-02 11:04:50
0
as you can see I have some timestamps with values. It is a measurement of a device. It takes a sample every 5 seconds. It is only a subset of all data. There are some group with low value and high value. I define high value if it is greater then 10. If consecutive rows have high value then I consider it as a group. What I would like to achieve:
count number of groups in day
for each group calculate duration
I will show example of my desired result below:
DateTime
Value
GroupId
Duration (in seconds)
2022-01-01 11:03:45
0
NaN
Nan
2022-01-01 11:03:50
40
1
15
2022-01-01 11:03:55
50
1
15
2022-01-01 11:04:00
60
1
15
2022-01-01 11:04:05
5
NaN
Nan
2022-01-01 11:04:10
4
NaN
Nan
2022-01-01 11:04:15
3
NaN
Nan
2022-01-01 11:04:20
0
NaN
Nan
2022-01-01 11:04:25
0
NaN
Nan
2022-01-01 11:04:30
40
2
10
2022-01-01 11:04:35
50
2
10
2022-01-01 11:04:40
4
NaN
Nan
2022-01-01 11:04:45
3
NaN
Nan
2022-01-01 11:04:50
0
NaN
Nan
2022-01-02 11:03:45
0
NaN
Nan
2022-01-02 11:03:50
5
NaN
Nan
2022-01-02 11:03:55
50
1
10
2022-01-02 11:04:00
60
1
10
2022-01-02 11:04:05
5
NaN
Nan
2022-01-02 11:04:10
4
NaN
Nan
2022-01-02 11:04:15
3
NaN
Nan
2022-01-02 11:04:20
0
NaN
Nan
2022-01-02 11:04:25
49
2
15
2022-01-02 11:04:30
40
2
15
2022-01-02 11:04:35
50
2
15
2022-01-02 11:04:40
4
NaN
Nan
2022-01-02 11:04:45
3
NaN
Nan
2022-01-02 11:04:50
0
NaN
Nan
I know how to read data in Pandas and do basic manipulation, can you give me any hints on how to find those groups and how to measure their duration and assign a number to them? THanks!
For GroupId greate groups by consecutive values greater like 10 and aggregate cumulative sum by GroupBy.cumsum, then per dates and GroupId get maximal and minimal datetime and subtract, last add 5 seconds because sample every 5 seconds:
df['DateTime'] = pd.to_datetime(df['DateTime'])
s = df['Value'].gt(10)
date = df['DateTime'].dt.date
df['GroupId'] = s.ne(s.shift())[s].groupby(date).cumsum()
g = df.groupby([date,'GroupId'])['DateTime']
df['Duration (in seconds)'] = (g.transform('max').sub(g.transform('min'))
.dt.total_seconds().add(5))
print (df)
DateTime Value GroupId Duration (in seconds)
0 2022-01-01 11:03:45 0 NaN NaN
1 2022-01-01 11:03:50 40 1.0 15.0
2 2022-01-01 11:03:55 50 1.0 15.0
3 2022-01-01 11:04:00 60 1.0 15.0
4 2022-01-01 11:04:05 5 NaN NaN
5 2022-01-01 11:04:10 4 NaN NaN
6 2022-01-01 11:04:15 3 NaN NaN
7 2022-01-01 11:04:20 0 NaN NaN
8 2022-01-01 11:04:25 0 NaN NaN
9 2022-01-01 11:04:30 40 2.0 10.0
10 2022-01-01 11:04:35 50 2.0 10.0
11 2022-01-01 11:04:40 4 NaN NaN
12 2022-01-01 11:04:45 3 NaN NaN
13 2022-01-01 11:04:50 0 NaN NaN
14 2022-01-02 11:03:45 0 NaN NaN
15 2022-01-02 11:03:50 5 NaN NaN
16 2022-01-02 11:03:55 50 1.0 10.0
17 2022-01-02 11:04:00 60 1.0 10.0
18 2022-01-02 11:04:05 5 NaN NaN
19 2022-01-02 11:04:10 4 NaN NaN
20 2022-01-02 11:04:15 3 NaN NaN
21 2022-01-02 11:04:20 0 NaN NaN
22 2022-01-02 11:04:25 49 2.0 15.0
23 2022-01-02 11:04:30 40 2.0 15.0
24 2022-01-02 11:04:35 50 2.0 15.0
25 2022-01-02 11:04:40 4 NaN NaN
26 2022-01-02 11:04:45 3 NaN NaN
27 2022-01-02 11:04:50 0 NaN NaN
Another idea for count Duration by previous matched value per groups:
df['DateTime'] = pd.to_datetime(df['DateTime'])
s = df['Value'].gt(10)
date = df['DateTime'].dt.date
df['GroupId'] = s.ne(s.shift())[s].groupby(date).cumsum()
prev = df.groupby(date)['GroupId'].bfill(limit=1)
g = df.groupby([date,prev])['DateTime']
df['Duration (in seconds)'] = (g.transform('max').sub(g.transform('min'))
.dt.total_seconds()
.where(s))
print (df)
DateTime Value GroupId Duration (in seconds)
0 2022-01-01 11:03:45 0 NaN NaN
1 2022-01-01 11:03:50 40 1.0 15.0
2 2022-01-01 11:03:55 50 1.0 15.0
3 2022-01-01 11:04:00 60 1.0 15.0
4 2022-01-01 11:04:05 5 NaN NaN
5 2022-01-01 11:04:10 4 NaN NaN
6 2022-01-01 11:04:15 3 NaN NaN
7 2022-01-01 11:04:20 0 NaN NaN
8 2022-01-01 11:04:25 0 NaN NaN
9 2022-01-01 11:04:30 40 2.0 10.0
10 2022-01-01 11:04:35 50 2.0 10.0
11 2022-01-01 11:04:40 4 NaN NaN
12 2022-01-01 11:04:45 3 NaN NaN
13 2022-01-01 11:04:50 0 NaN NaN
14 2022-01-02 11:03:45 0 NaN NaN
15 2022-01-02 11:03:50 5 NaN NaN
16 2022-01-02 11:03:55 50 1.0 10.0
17 2022-01-02 11:04:00 60 1.0 10.0
18 2022-01-02 11:04:05 5 NaN NaN
19 2022-01-02 11:04:10 4 NaN NaN
20 2022-01-02 11:04:15 3 NaN NaN
21 2022-01-02 11:04:20 0 NaN NaN
22 2022-01-02 11:04:25 49 2.0 15.0
23 2022-01-02 11:04:30 40 2.0 15.0
24 2022-01-02 11:04:35 50 2.0 15.0
25 2022-01-02 11:04:40 4 NaN NaN
26 2022-01-02 11:04:45 3 NaN NaN
27 2022-01-02 11:04:50 0 NaN NaN

Pandas row-wise aggregation with multi-index

I have a pandas dataframe where there's three levels of row indexing. The last level is a datetime index. There are nan values and I am trying to fill them with the average of each row at the datetime level. How can I go about doing this?
data_df
Level 0 | Level 1 | Level 2 |
A 123 2019-01-28 17:00:00 | 3 | 1 | nan
2019-01-28 18:00:00 | 2 | nan | 1
2019-01-28 19:00:00 | nan | nan | 5
234 2019-01-28 05:00:00 | 1 | 1 | 3
2019-01-28 06:00:00 | nan | nan | nan
Some rows may all be nan values. In this case I want to fill the row with 0's. Some rows may have all values filled in so imputing with average isn't needed.
I want this the following result:
Level 0 | Level 1 | Level 2 |
A 123 2019-01-28 17:00:00 | 3 | 1 | 2
2019-01-28 18:00:00 | 2 | 1.5 | 1
2019-01-28 19:00:00 | 5 | 5 | 5
234 2019-01-28 05:00:00 | 1 | 1 | 3
2019-01-28 06:00:00 | 0 | 0 | 0
Use DataFrame.mask with mean per rows and last convert only NaNs rows by DataFrame.fillna:
df = df.mask(df.isna(), df.mean(axis=1), axis=0).fillna(0)
print (df)
a b c
Level 0 Level 1 Level 2
A 123 2019-01-28 17:00:00 3.0 1.0 2.0
2019-01-28 18:00:00 2.0 1.5 1.0
2019-01-28 19:00:00 5.0 5.0 5.0
234 2019-01-28 05:00:00 1.0 1.0 3.0
2019-01-28 06:00:00 0.0 0.0 0.0
Another solution is use DataFrame.fillna for replace, but because not implemented df.fillna(df.mean(axis=1), axis=1) is necessary double transpose:
df = df.T.fillna(df.mean(axis=1)).fillna(0).T

Retrieve time difference since last action -- python/pandas

Let's say I have purchase records with two fields Buy and Time.
What I want to get is the third column of time elapsed since first not-buy so it looks like:
buy| time | time difference
1 | 8:00 | NULL
0 | 9:01 | NULL
0 | 9:10 | NULL
0 | 9:21 | NULL
1 | 9:31 | 0:30
0 | 9:41 | NULL
0 | 9:42 | NULL
1 | 9:53 | 0:12
How can I achieve this? It seems to me that it's a mix of pd.groupby() and pd.shift(), but I can't seem to work that out in my head.
IIUC
df.time=pd.to_datetime(df.time)
df.loc[df.buy==1,'DIFF']=df.groupby(df.buy.cumsum().shift().fillna(0)).time.transform(lambda x : x.iloc[-1]-x.iloc[0])
df
Out[19]:
buy time timedifference DIFF
0 1 2018-02-26 08:00:00 NaN 00:00:00
1 0 2018-02-26 09:01:00 NaN NaT
2 0 2018-02-26 09:10:00 NaN NaT
3 0 2018-02-26 09:21:00 NaN NaT
4 1 2018-02-26 09:31:00 0:30 00:30:00
5 0 2018-02-26 09:41:00 NaN NaT
6 0 2018-02-26 09:42:00 NaN NaT
7 1 2018-02-26 09:53:00 0:12 00:12:00
#df.buy.cumsum().shift().fillna(0) Create the key for groupby
#time.transform(lambda x : x.iloc[-1]-x.iloc[0]) create the different for each group
#df.loc[df.buy==1,'DIFF'] fill the value from groupby by the right position which buy equal to 1

Add missing times in dataframe column with pandas

I have a dataframe like so:
df = pd.DataFrame({'time':['23:59:45','23:49:50','23:59:55','00:00:00','00:00:05','00:00:10','00:00:15'],
'X':[-5,-4,-2,5,6,10,11],
'Y':[3,4,5,9,20,22,23]})
As you can see, the time is formed by hours (string format) and are across midnight. The time is given every 5 seconds!
My goal is however to add empty rows (filled with Nan for examples) so that the time is every second. Finally the column time should be converted as a time stamp and set as index.
Could you please suggest a smart and elegant way to achieve my goal?
Here is what the output should look like:
X Y
time
23:59:45 -5.0 3.0
23:59:46 NaN NaN
23:59:47 NaN NaN
23:59:48 NaN NaN
... ... ...
00:00:10 10.0 22.0
00:00:11 NaN NaN
00:00:12 NaN NaN
00:00:13 NaN NaN
00:00:14 NaN NaN
00:00:15 11.0 23.0
Note: I do not need the dates.
Use to_timedelta with reindex by timedelta_range:
df['time'] = pd.to_timedelta(df['time'])
idx = pd.timedelta_range('0', '23:59:59', freq='S', name='time')
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 00:00:00 5.0 9.0
1 00:00:01 NaN NaN
2 00:00:02 NaN NaN
3 00:00:03 NaN NaN
4 00:00:04 NaN NaN
5 00:00:05 6.0 20.0
6 00:00:06 NaN NaN
7 00:00:07 NaN NaN
8 00:00:08 NaN NaN
9 00:00:09 NaN NaN
If need replace NaNs:
df = df.set_index('time').reindex(idx, fill_value=0).reset_index()
print (df.head(10))
time X Y
0 00:00:00 5 9
1 00:00:01 0 0
2 00:00:02 0 0
3 00:00:03 0 0
4 00:00:04 0 0
5 00:00:05 6 20
6 00:00:06 0 0
7 00:00:07 0 0
8 00:00:08 0 0
9 00:00:09 0 0
Another solution with resample, but is possible some rows are missing in the end:
df = df.set_index('time').resample('S').first()
print (df.tail(10))
X Y
time
23:59:46 NaN NaN
23:59:47 NaN NaN
23:59:48 NaN NaN
23:59:49 NaN NaN
23:59:50 NaN NaN
23:59:51 NaN NaN
23:59:52 NaN NaN
23:59:53 NaN NaN
23:59:54 NaN NaN
23:59:55 -2.0 5.0
EDIT1:
idx1 = pd.timedelta_range('23:59:45', '23:59:59', freq='S', name='time')
idx2 = pd.timedelta_range('0', '00:00:15', freq='S', name='time')
idx = np.concatenate([idx1, idx2])
df['time'] = pd.to_timedelta(df['time'])
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 23:59:45 -5.0 3.0
1 23:59:46 NaN NaN
2 23:59:47 NaN NaN
3 23:59:48 NaN NaN
4 23:59:49 NaN NaN
5 23:59:50 NaN NaN
6 23:59:51 NaN NaN
7 23:59:52 NaN NaN
8 23:59:53 NaN NaN
9 23:59:54 NaN NaN
print (df.tail(10))
time X Y
21 00:00:06 NaN NaN
22 00:00:07 NaN NaN
23 00:00:08 NaN NaN
24 00:00:09 NaN NaN
25 00:00:10 10.0 22.0
26 00:00:11 NaN NaN
27 00:00:12 NaN NaN
28 00:00:13 NaN NaN
29 00:00:14 NaN NaN
30 00:00:15 11.0 23.0
EDIT:
Another solution - change next day to 1 day timedeltas:
df['time'] = pd.to_timedelta(df['time'])
a = pd.to_timedelta(df['time'].diff().dt.days.abs().cumsum().fillna(1).sub(1), unit='d')
df['time'] = df['time'] + a
print (df)
X Y time
0 -5 3 0 days 23:59:45
1 -4 4 0 days 23:49:50
2 -2 5 0 days 23:59:55
3 5 9 1 days 00:00:00
4 6 20 1 days 00:00:05
5 10 22 1 days 00:00:10
6 11 23 1 days 00:00:15
idx = pd.timedelta_range(df['time'].min(), df['time'].max(), freq='S', name='time')
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 23:49:50 -4.0 4.0
1 23:49:51 NaN NaN
2 23:49:52 NaN NaN
3 23:49:53 NaN NaN
4 23:49:54 NaN NaN
5 23:49:55 NaN NaN
6 23:49:56 NaN NaN
7 23:49:57 NaN NaN
8 23:49:58 NaN NaN
9 23:49:59 NaN NaN
print (df.tail(10))
time X Y
616 1 days 00:00:06 NaN NaN
617 1 days 00:00:07 NaN NaN
618 1 days 00:00:08 NaN NaN
619 1 days 00:00:09 NaN NaN
620 1 days 00:00:10 10.0 22.0
621 1 days 00:00:11 NaN NaN
622 1 days 00:00:12 NaN NaN
623 1 days 00:00:13 NaN NaN
624 1 days 00:00:14 NaN NaN
625 1 days 00:00:15 11.0 23.0

Categories

Resources