I have a data frame (namend df) from 2016/1/1 00:00 until 2018/11/25 23:00 with a timestamp every hour, object_id and a value. The data set only contains rows where an object_id has a value.
timestampHour object_id value
2016/1/1 00:00 1 2
2016/1/1 00:00 3 1
2016/1/1 01:00 1 1
2016/1/1 01:00 2 3
2016/1/1 02:00 2 3
2016/1/1 02:00 3 2
I would like to get a dataframe showing all object id's for every hour, with a null value if there is no value.
timestampHour object_id value
2016/1/1 00:00 1 2
2016/1/1 00:00 2 null
2016/1/1 00:00 3 1
2016/1/1 01:00 1 1
2016/1/1 01:00 2 3
2016/1/1 01:00 3 null
2016/1/1 02:00 1 null
2016/1/1 02:00 2 3
2016/1/1 02:00 3 2
I have created the dateTime from timestamps. And rounded them to hours with the following code:
df["timestamp"] = pd.to_datetime(df["result_timestamp"])
df['timestampHour'] = df['result_timestamp'].dt.round('60min')
(I don't know if there are better options, but I have been trying to create timestampHour rows until 12 (I have 12 every unique object_id) and fill those newly created rows with (the for that hour) unused object_id. But I have not been able to create the empty rows, with the condition)
I am fairly new to programming and I am not finding a clue to get closer to solving this problem from searching other posts.
Using pivot_table and unstack:
df.pivot_table(
index='object_id', columns='timestampHour', values='value'
).unstack().rename('value').reset_index()
timestampHour object_id value
0 2016/1/1 00:00 1 2.0
1 2016/1/1 00:00 2 NaN
2 2016/1/1 00:00 3 1.0
3 2016/1/1 01:00 1 1.0
4 2016/1/1 01:00 2 3.0
5 2016/1/1 01:00 3 NaN
6 2016/1/1 02:00 1 NaN
7 2016/1/1 02:00 2 3.0
8 2016/1/1 02:00 3 2.0
To see why this works, the intermediate pivot_table is helpful to look at:
timestampHour 2016/1/1 00:00 2016/1/1 01:00 2016/1/1 02:00
object_id
1 2.0 1.0 NaN
2 NaN 3.0 3.0
3 1.0 NaN 2.0
Where a value is not found for a combination of object_id and timestampHour, a NaN is added to the table. When you use unstack, these NaN's are kept, giving you the desired result with missing values represented.
This is also .reindex with a cartesian product of the two levels. This question goes into detail on ways to optimize the performance of the product for large datasets.
import pandas as pd
id_cols = ['timestampHour', 'object_id']
idx = pd.MultiIndex.from_product(df[id_cols].apply(pd.Series.unique).values.T, names=id_cols)
df.set_index(id_cols).reindex(idx).reset_index()
Output:
timestampHour object_id value
0 2016/1/1 00:00 1 2.0
1 2016/1/1 00:00 3 1.0
2 2016/1/1 00:00 2 NaN
3 2016/1/1 01:00 1 1.0
4 2016/1/1 01:00 3 NaN
5 2016/1/1 01:00 2 3.0
6 2016/1/1 02:00 1 NaN
7 2016/1/1 02:00 3 2.0
8 2016/1/1 02:00 2 3.0
Related
I have data with date, time, and values and want to calculate a forward looking rolling maximum for each date:
Date Time Value Output
01/01/2022 01:00 1.3 1.4
01/01/2022 02:00 1.4 1.2
01/01/2022 03:00 0.9 1.2
01/01/2022 04:00 1.2 NaN
01/02/2022 01:00 5 4
01/02/2022 02:00 4 3
01/02/2022 03:00 2 3
01/02/2022 04:00 3 NaN
I have tried this:
df = df.sort_values(by=['Date','Time'], ascending=True)
df['rollingmax'] = df.groupby(['Date'])['Value'].rolling(window=4,min_periods=0).max()
df = df.sort_values(by=['Date','Time'], ascending=False)
but that doesn't seem to work...
It looks like you want a shifted reverse rolling max:
n = 4
df['Output'] = (df[::-1]
.groupby('Date')['Value']
.apply(lambda g: g.rolling(n-1, min_periods=1).max().shift())
)
Output:
Date Time Value Output
0 01/01/2022 01:00 1.3 1.4
1 01/01/2022 02:00 1.4 1.2
2 01/01/2022 03:00 0.9 1.2
3 01/01/2022 04:00 1.2 NaN
4 01/02/2022 01:00 5.0 4.0
5 01/02/2022 02:00 4.0 3.0
6 01/02/2022 03:00 2.0 3.0
7 01/02/2022 04:00 3.0 NaN
I have a dataframe, x_train, with three variables an index datetime which takes a reading every 5 minutes and an ID column:
x_train
Time ID var_1 var_2 var_3
2020-01-01 00:00:00 1 9.3 4.2 2.4
2020-01-02 00:00:05 1 3.5 4.5 7.6
2020-01-01 00:00:00 2 2.1 7.6 4.5
2020-01-02 00:00:05 2 3.9 7.5 7.0
and a second dataframe, y_train, with labels for each mode the IDs are in:
y_train
Time ID mode label
2020-01-01 00:00:00 1 1 B
2020-01-02 00:00:05 1 1 B
2020-01-01 00:00:00 2 0 A
2020-01-02 00:00:05 2 0 A
I want to slice the data by ID and time with a step size of 1 day or 288 rows as this data is time-series dependent. So far I've managed to split the data by id using groupby, however I'm not sure how to apply the time slicing.
Heres what I've tried:
FEATURE_COLUMNS = X_train.columns.to_list()
sequences = []
for Id, group in X_train.groupby("ID"):
sequence_features = group[FEATURE_COLUMNS]
label = y_train[y_train.ID == ID].iloc[0].label
sequences.append((sequence_features, label))
Which gives me a slice of all the different IDs but not the time sliced:
( ID var_1 var_2 var_3
Time
2016-01-09 01:55:00 2 0.402679 0.588398 0.560771
2016-03-22 11:40:00 2 0.382457 0.507188 0.450901
2016-02-29 09:40:00 2 0.344540 0.652963 0.607460
2016-01-06 01:00:00 2 0.384479 0.825977 0.499619
2016-01-19 18:10:00 2 0.437563 0.631526 0.479827
... ... ... ... ...
2016-01-10 23:30:00 2 0.366026 0.829760 0.636387
2016-01-22 18:25:00 2 0.976997 0.350567 0.674448
2016-01-28 06:30:00 2 0.975986 0.719546 0.727988
2016-02-27 04:15:00 2 0.451972 0.674149 0.470185
2016-03-10 19:15:00 2 0.354146 0.423203 0.487947
[17673 rows x 4 columns],
'b')
I feel I need to add a line that tells the loop to only look at 288 rows per ID at a time but I'm not sure how to execute it.
Edit: also my sliced output data reorganises the index datetime in a weird order is there a way to fix this?
I have a table with the following structure; the count column gets updated every time a user accesses the app again on that date.
user_id
date
count
1
1/1/2021
4
2
1/1/2021
7
1
1/2/2021
3
3
1/2/2021
10
2
1/3/2021
4
4
1/1/2021
12
I want to de-aggregate this data based on the count, so for example, user_id of 1 will have four records on 1/1/2021 without the count column. After that, I want to concatenate a random time to the date. My output would like this:
user_id
date_time
1
1/1/2021 16:00:21
1
1/1/2021 7:23:55
1
1/1/2021 12:01:45
1
1/1/2021 21:21:07
I'm using pandas for this. Randomizing the timestamps is straightforward I think, just de-aggregating the data based on a column is a little tricky for me.
You can duplicate the index and add a random time between 0 and 24 hours:
(df.loc[df.index.repeat(df['count'])]
.assign(date=lambda d: pd.to_datetime(d['date'])
+pd.to_timedelta(np.random.randint(0,24*3600, size=len(d)), unit='s'))
.rename({'date': 'date_time'})
.drop('count', axis=1)
)
output:
user_id date
0 1 2021-01-01 03:32:40
0 1 2021-01-01 03:54:18
0 1 2021-01-01 00:57:49
0 1 2021-01-01 13:04:08
1 2 2021-01-01 00:34:03
1 2 2021-01-01 00:14:17
1 2 2021-01-01 03:57:20
1 2 2021-01-01 22:01:11
1 2 2021-01-01 22:09:55
1 2 2021-01-01 13:15:36
1 2 2021-01-01 12:26:39
2 1 2021-01-02 22:51:17
2 1 2021-01-02 13:44:12
2 1 2021-01-02 01:39:14
3 3 2021-01-02 09:22:16
3 3 2021-01-02 03:34:15
3 3 2021-01-02 23:05:49
3 3 2021-01-02 02:21:35
3 3 2021-01-02 19:51:41
3 3 2021-01-02 16:02:20
3 3 2021-01-02 18:14:05
3 3 2021-01-02 09:07:14
3 3 2021-01-02 22:43:44
3 3 2021-01-02 20:48:15
4 2 2021-01-03 19:25:04
4 2 2021-01-03 14:08:03
4 2 2021-01-03 21:23:58
4 2 2021-01-03 17:24:58
5 4 2021-01-01 23:37:41
5 4 2021-01-01 06:06:17
5 4 2021-01-01 19:23:29
5 4 2021-01-01 02:12:50
5 4 2021-01-01 08:09:59
5 4 2021-01-01 03:49:30
5 4 2021-01-01 08:00:42
5 4 2021-01-01 08:03:34
5 4 2021-01-01 15:36:12
5 4 2021-01-01 14:50:43
5 4 2021-01-01 14:54:04
5 4 2021-01-01 14:58:08
Let's say I have the following data:
import pandas as pd
csv = [
['2019-05-01 00:00', ],
['2019-05-01 01:00', 2],
['2019-05-01 02:00', 4],
['2019-05-01 03:00', ],
['2019-05-01 04:00', 2],
['2019-05-01 05:00', 4],
['2019-05-01 06:00', 6],
['2019-05-01 07:00', ],
['2019-05-01 08:00', ],
['2019-05-01 09:00', 2]]
df = pd.DataFrame(csv, columns=["DateTime", "Value"])
So I am working with a time series with gaps in data:
DateTime Value
0 2019-05-01 00:00 NaN
1 2019-05-01 01:00 2.0
2 2019-05-01 02:00 4.0
3 2019-05-01 03:00 NaN
4 2019-05-01 04:00 2.0
5 2019-05-01 05:00 4.0
6 2019-05-01 06:00 6.0
7 2019-05-01 07:00 NaN
8 2019-05-01 08:00 NaN
9 2019-05-01 09:00 2.0
Now, I want to work one by one with each chunk of existing data. I mean, I want to split the series in the compact pieces between NaNs. The goal is to iterate these chunks so I can pass each one individually to another function which can't handle gaps in data. Then, I want to store the result in the original dataframe in its corresponding place. For a trivial example, let's say the function calculates the average value of the chunk. Expected result:
DateTime Value ChunkAverage
0 2019-05-01 00:00 NaN NaN
1 2019-05-01 01:00 2.0 3.0
2 2019-05-01 02:00 4.0 3.0
3 2019-05-01 03:00 NaN NaN
4 2019-05-01 04:00 2.0 4.0
5 2019-05-01 05:00 4.0 4.0
6 2019-05-01 06:00 6.0 4.0
7 2019-05-01 07:00 NaN NaN
8 2019-05-01 08:00 NaN NaN
9 2019-05-01 09:00 2.0 2.0
I know this can be made in a "traditional way" with iterating loops, "if" clauses, slicing with indexes, etc. But I guess there is something more efficient and safe built in Pandas. But I can't figure out how.
You can use df.groupby, with using pd.Series.isna with pd.Series.cumsum
g = df.Value.isna().cumsum()
df.assign(chunk = df.Value.groupby(g).transform('mean').mask(df.Value.isna()))
# df['chunk'] = df.Value.groupby(g).transform('mean').mask(df.Value.isna()))
# df['chunk'] = df.Value.groupby(g).transform('mean').where(df.Value.notna())
DateTime Value chunk
0 2019-05-01 00:00 NaN NaN
1 2019-05-01 01:00 2.0 3.0
2 2019-05-01 02:00 4.0 3.0
3 2019-05-01 03:00 NaN NaN
4 2019-05-01 04:00 2.0 4.0
5 2019-05-01 05:00 4.0 4.0
6 2019-05-01 06:00 6.0 4.0
7 2019-05-01 07:00 NaN NaN
8 2019-05-01 08:00 NaN NaN
9 2019-05-01 09:00 2.0 2.0
Note:
df.assign(...) gives new dataframe.
df['chunk'] = ... mutate the original dataframe in-place
One possibility would be to add a separator column, based on the NaN in Value, and group by that:
df['separator']=df['Value'].isna().cumsum().fillna("")
df['Value'] = df['Value'].fillna("")
grp = df.groupby('separator').agg(avg = pd.NamedAgg(column='Value', aggfunc='sum'))
print(grp)
This counts the values in each group:
avg
separator
1 2
2 3
3 0
4 1
How you want to fill the NaNs depends a bit on what you want to achieve with the calculation.
I have a dataframe like this,
ID 00:00 01:00 02:00 ... 23:00 avg_value
22 4.7 5.3 6 ... 8 5.5
37 0 9.2 4.5 ... 11.2 9.2
4469 2 9.8 11 ... 2 6.4
Can I use np.where to apply conditions on multiple columns at once?
I want to update the values from 00:00 to 23:00 to 0 and 1. If the value at the time of day is greater than avg_value then I change it to 1, else to 0.
I know how to apply this method to one single column.
np.where(df['00:00']>df['avg_value'],1,0)
Can I change it to multiple columns?
Output will be like,
ID 00:00 01:00 02:00 ... 23:00 avg_value
22 0 1 1 ... 1 5.5
37 0 0 0 ... 1 9.2
4469 0 1 1 ... 0 6.4
Select all columns without last by DataFrame.iloc, compare by DataFrame.gt and casting to integers and last add avg_value column by DataFrame.join:
df = df.iloc[:, :-1].gt(df['avg_value'], axis=0).astype(int).join(df['avg_value'])
print (df)
00:00 01:00 02:00 23:00 avg_value
ID
22 0 0 1 1 5.5
37 0 0 0 1 9.2
4469 0 1 1 0 6.4
Or use DataFrame.pop for extract column:
s = df.pop('avg_value')
df = df.gt(s, axis=0).astype(int).join(s)
print (df)
00:00 01:00 02:00 23:00 avg_value
ID
22 0 0 1 1 5.5
37 0 0 0 1 9.2
4469 0 1 1 0 6.4
Because if assign to same columns integers are converted to floats (it is bug):
df.iloc[:, :-1] = df.iloc[:, :-1].gt(df['avg_value'], axis=0).astype(int)
print (df)
00:00 01:00 02:00 23:00 avg_value
ID
22 0.0 0.0 1.0 1.0 5.5
37 0.0 0.0 0.0 1.0 9.2
4469 0.0 1.0 1.0 0.0 6.4