I want to resampling a data frame which has a time series data at 30 second interval to 1 second interval. For this I used:
test_data=test_data.groupby('entity_id').resample('S', fill_method='ffill')
The output is:
<pandas.core.resample.DatetimeIndexResamplerGroupby object at 0x1a1f64f588>
How can I convert this object to a data frame?
I have tried:
test_data = pd.DataFrame(test_data)
after running the last command but it returns a data frame which has the index and a list of all other elements of that row.
Use ffill method:
test_data = pd.DataFrame({
'entity_id': ['a','a','a','a','b','b','b','c','d'],
'data':range(9)},
index=pd.date_range('2018-01-01', periods=9, freq='3S'))
print (test_data)
entity_id data
2018-01-01 00:00:00 a 0
2018-01-01 00:00:03 a 1
2018-01-01 00:00:06 a 2
2018-01-01 00:00:09 a 3
2018-01-01 00:00:12 b 4
2018-01-01 00:00:15 b 5
2018-01-01 00:00:18 b 6
2018-01-01 00:00:21 c 7
2018-01-01 00:00:24 d 8
test_data=test_data.groupby('entity_id')['data'].resample('S').ffill()
print (test_data)
entity_id
a 2018-01-01 00:00:00 0
2018-01-01 00:00:01 0
2018-01-01 00:00:02 0
2018-01-01 00:00:03 1
2018-01-01 00:00:04 1
2018-01-01 00:00:05 1
2018-01-01 00:00:06 2
2018-01-01 00:00:07 2
2018-01-01 00:00:08 2
2018-01-01 00:00:09 3
b 2018-01-01 00:00:12 4
2018-01-01 00:00:13 4
2018-01-01 00:00:14 4
2018-01-01 00:00:15 5
2018-01-01 00:00:16 5
2018-01-01 00:00:17 5
2018-01-01 00:00:18 6
c 2018-01-01 00:00:21 7
d 2018-01-01 00:00:24 8
Name: data, dtype: int64
Related
I have a dataset with 15-minutes observations for different stations for 20 years. I want to know the range time that each station has data.
station_id
start_time
end_time
observation
2
2000-01-02 01:00:00
2000-01-02 01:15:00
50
2
2000-01-02 01:15:00
2000-01-02 01:30:00
15
2
2000-02-02 01:30:00
2000-01-02 01:45:00
3
3
2000-01-02 05:00:00
2000-01-02 05:15:00
10
3
2000-01-02 05:15:00
2000-01-02 05:30:00
2
3
2000-02-03 01:00:00
2000-01-02 01:15:00
15
3
2000-02-04 01:00:00
2000-01-02 01:15:00
20
an example of I want to have
|station_id | start | end | years |days
| 2 |2000-01-02 01:00:00|2000-01-02 01:45:00| 1 | 1
| 3 |2000-01-02 05:00:00|2000-01-02 01:15:00| 1 | 1
Try using groupby, diff, abs, agg and assign:
df[['start_time', 'end_time']] = df[['start_time', 'end_time']].apply(pd.to_datetime)
x = df.groupby('station_id').agg({'start_time': 'first', 'end_time': 'last'})
temp = x.diff(axis=1).abs()['end_time']
x = x.assign(years=temp.dt.days // 365, days=temp.dt.days % 365).reset_index()
print(x)
I have a time series dataset that can be created with the following code.
idx = pd.date_range("2018-01-01", periods=100, freq="H")
ts = pd.Series(idx)
dft = pd.DataFrame(ts,columns=["date"])
dft["data"] = ""
dft["data"][0:5]= "a"
dft["data"][5:15]= "b"
dft["data"][15:20]= "c"
dft["data"][20:30]= "d"
dft["data"][30:40]= "a"
dft["data"][40:70]= "c"
dft["data"][70:85]= "b"
dft["data"][85:len(dft)]= "c"
In the data column, the unique values are a,b,c,d. These values are repeating in a sequence in different time windows. I want to capture the first and last value of that time window. How can I do that?
Compute a grouper for your changing values using shift to compare consecutive rows, then use groupby+agg to get the min/max per group:
group = dft.data.ne(dft.data.shift()).cumsum()
dft.groupby(group)['date'].agg(['min', 'max'])
output:
min max
data
1 2018-01-01 00:00:00 2018-01-01 04:00:00
2 2018-01-01 05:00:00 2018-01-01 14:00:00
3 2018-01-01 15:00:00 2018-01-01 19:00:00
4 2018-01-01 20:00:00 2018-01-02 05:00:00
5 2018-01-02 06:00:00 2018-01-02 15:00:00
6 2018-01-02 16:00:00 2018-01-03 21:00:00
7 2018-01-03 22:00:00 2018-01-04 12:00:00
8 2018-01-04 13:00:00 2018-01-05 03:00:00
edit. combining with original data:
dft.groupby(group).agg({'data': 'first', 'date': ['min', 'max']})
output:
data date
first min max
data
1 a 2018-01-01 00:00:00 2018-01-01 04:00:00
2 b 2018-01-01 05:00:00 2018-01-01 14:00:00
3 c 2018-01-01 15:00:00 2018-01-01 19:00:00
4 d 2018-01-01 20:00:00 2018-01-02 05:00:00
5 a 2018-01-02 06:00:00 2018-01-02 15:00:00
6 c 2018-01-02 16:00:00 2018-01-03 21:00:00
7 b 2018-01-03 22:00:00 2018-01-04 12:00:00
8 c 2018-01-04 13:00:00 2018-01-05 03:00:00
Sample Data
date_rng = pd.date_range('2019-01-01', freq='s', periods=400)
df = pd.DataFrame(np.random.lognormal(.005, .5, size=(len(date_rng))),
columns=['data1'],
index=date_rng)
Example input
data1
2019-01-01 00:00:00 1
2019-01-01 00:00:01 -2
2019-01-01 00:00:02 2
2019-01-01 00:00:03 3
2019-01-01 00:00:04 1
2019-01-01 00:00:05 2
2019-01-01 00:00:06 -1
2019-01-01 00:00:07 3
2019-01-01 00:00:08 4
2019-01-01 00:00:09 5
2019-01-01 00:00:10 7
2019-01-01 00:00:11 2
2019-01-01 00:00:12 4
2019-01-01 00:00:13 -1
2019-01-01 00:00:14 5
2019-01-01 00:00:15 3
2019-01-01 00:00:16 5
2019-01-01 00:00:17 -3
... ...
Expected output
data1 cumsum
2019-01-01 00:00:00 1 1
2019-01-01 00:00:01 -2 -1
2019-01-01 00:00:02 2 1
2019-01-01 00:00:03 3 4
2019-01-01 00:00:04 1 5 (reset cumsum at this point)
2019-01-01 00:00:05 2 2
2019-01-01 00:00:06 -1 1
2019-01-01 00:00:07 3 4
2019-01-01 00:00:08 4 8 (reset at this point)
2019-01-01 00:00:09 5 5 (reset at this point)
2019-01-01 00:00:10 7 7 (reset at this point)
2019-01-01 00:00:11 2 2
2019-01-01 00:00:12 4 6 (reset at this point)
2019-01-01 00:00:13 -1 -1
2019-01-01 00:00:14 5 4
2019-01-01 00:00:15 3 7 (reset at this point)
2019-01-01 00:00:16 1 1
2019-01-01 00:00:17 -3 -2
... ...
I want to calculate the cumulative sum of the data1 based on the condtiontion that it will reset if it larger than 5 and exceed 20% of the last reset. For example: for the first calculation, it will only reset if it's larger than 5, after that, it will reset based on both conditions.
I have checked out some other answers on stackoverflow but I didn't find similar problems. Please advice me on how to solve this problem.
You can use a simple loop with if else and some logic to solve this problem.
ls = [] # store result
cumsum = 0
last_reset = 0
for _, row in df.iterrows():
cumsum = cumsum + row.data1
ls.append(cumsum)
if cumsum >= 5 and cumsum > 0.2*last_reset:
last_reset = cumsum
cumsum = 0
df['cumsum'] = ls
df
data1 cumsum
2019-01-01 00:00:00 1 1
2019-01-01 00:00:01 -2 -1
2019-01-01 00:00:02 2 1
2019-01-01 00:00:03 3 4
2019-01-01 00:00:04 1 5
2019-01-01 00:00:05 2 2
2019-01-01 00:00:06 -1 1
2019-01-01 00:00:07 3 4
2019-01-01 00:00:08 4 8
2019-01-01 00:00:09 5 5
2019-01-01 00:00:10 7 7
2019-01-01 00:00:11 2 2
2019-01-01 00:00:12 4 6
2019-01-01 00:00:13 -1 -1
2019-01-01 00:00:14 5 4
2019-01-01 00:00:15 3 7
2019-01-01 00:00:16 1 1
2019-01-01 00:00:17 -3 -2
I created a dataframe with only datetime column with 1 second interval for jan 1, 2018 as shown in the code below.
i = pd.date_range(start='2018-01-01 00:00:00', end='2018-01-01 23:59:00', freq="1S")
ts = pd.DataFrame(index=i)
ts = ts.reset_index()
ts = ts.rename(columns={'index': 'datetime'})`
df1:
datetime
0 2018-01-01 00:00:00
1 2018-01-01 00:00:01
2 2018-01-01 00:00:02
3 2018-01-01 00:00:03
4 2018-01-01 00:00:04
5 2018-01-01 00:00:05
6 2018-01-01 00:00:06
7 2018-01-01 00:00:07
8 2018-01-01 00:00:08
9 2018-01-01 00:00:09
10 2018-01-01 00:00:10
11 2018-01-01 00:00:11
12 2018-01-01 00:00:12
13 2018-01-01 00:00:13
14 2018-01-01 00:00:14
15 2018-01-01 00:00:15
16 2018-01-01 00:00:16
17 2018-01-01 00:00:17
18 2018-01-01 00:00:18
19 2018-01-01 00:00:19
20 2018-01-01 00:00:20
21 2018-01-01 00:00:21
22 2018-01-01 00:00:22
23 2018-01-01 00:00:23
24 2018-01-01 00:00:24
25 2018-01-01 00:00:25
26 2018-01-01 00:00:26
27 2018-01-01 00:00:27
28 2018-01-01 00:00:28
29 2018-01-01 00:00:29`
I have another dataframe with a datetime column another columns
df2:
datetime a b c d e
0 2018-01-01 00:00:04 0.9
1 2018-01-01 00:00:06 0.6 0.7
2 2018-01-01 00:00:09 0.5 0.7 0.8
3 2018-01-01 00:00:16 2.3 3.6 4.9 5.0
4 2018-01-01 00:00:17 0.9 3.5 5.5
5 2018-01-01 00:00:23 0.1 0.6 0.0 1.7
6 2018-01-01 00:00:29 2.7 5.5 4.3 `
Now I am trying to map the datetime columns of df1 and df2 using pandas outer join and I would like my expected result to look like
datetime a b c d e
0 2018-01-01 00:00:00
1 2018-01-01 00:00:01
2 2018-01-01 00:00:02
3 2018-01-01 00:00:03
4 2018-01-01 00:00:04 0.9
5 2018-01-01 00:00:05
6 2018-01-01 00:00:06 0.6 0.7
7 2018-01-01 00:00:07
8 2018-01-01 00:00:08
9 2018-01-01 00:00:09 0.5 0.7 0.8
10 2018-01-01 00:00:10
11 2018-01-01 00:00:11
12 2018-01-01 00:00:12
13 2018-01-01 00:00:13
14 2018-01-01 00:00:14
15 2018-01-01 00:00:15
16 2018-01-01 00:00:16 2.3 3.6 4.9 5.0
17 2018-01-01 00:00:17 0.9 3.5 5.5
18 2018-01-01 00:00:18
19 2018-01-01 00:00:19
20 2018-01-01 00:00:20
21 2018-01-01 00:00:21
22 2018-01-01 00:00:22
23 2018-01-01 00:00:23 0.1 0.6 0.0 1.7
24 2018-01-01 00:00:24
25 2018-01-01 00:00:25
26 2018-01-01 00:00:26
27 2018-01-01 00:00:27
28 2018-01-01 00:00:28
29 2018-01-01 00:00:29 2.7 5.5 4.3 `
but my output looks like this
datetime a b c d e
0 2018-01-01 00:00:00
1 2018-01-01 00:00:01
2 2018-01-01 00:00:02
3 2018-01-01 00:00:03
4 2018-01-01 00:00:04
5 2018-01-01 00:00:05
6 2018-01-01 00:00:06
7 2018-01-01 00:00:07
8 2018-01-01 00:00:08
9 2018-01-01 00:00:09
10 2018-01-01 00:00:10
11 2018-01-01 00:00:11
12 2018-01-01 00:00:12
13 2018-01-01 00:00:13
14 2018-01-01 00:00:14
15 2018-01-01 00:00:15
16 2018-01-01 00:00:16
17 2018-01-01 00:00:17
18 2018-01-01 00:00:18
19 2018-01-01 00:00:19
20 2018-01-01 00:00:20
21 2018-01-01 00:00:21
22 2018-01-01 00:00:22
23 2018-01-01 00:00:23
24 2018-01-01 00:00:24
25 2018-01-01 00:00:25
26 2018-01-01 00:00:26
27 2018-01-01 00:00:27
28 2018-01-01 00:00:28
29 2018-01-01 00:00:29
30 2018-01-01 00:00:04 0.9
31 2018-01-01 00:00:06 0.6 0.7
32 2018-01-01 00:00:09 0.5 0.7 0.8
33 2018-01-01 00:00:16 2.3 3.6 4.9 5.0
34 2018-01-01 00:00:17 0.9 3.5 5.5
35 2018-01-01 00:00:23 0.1 0.6 0.0 1.7
36 2018-01-01 00:00:29 2.7 5.5 4.3 `
The code I am using to do that operation is:
test = pandas.merge(df1, df2, on = ['datetime'], how= 'outer')
I am not quite sure how to approach this issue and I would appreciate if I can get some help.
Keep ts with datetime index and use Reindex as #Scott Boston mentioned in the comments,
i = pd.date_range(start='2018-01-01 00:00:00', end='2018-01-01 23:59:00', freq="1S")
ts = pd.DataFrame(index=i)
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime').reindex(ts.index)
a b c d e
2018-01-01 00:00:00 NaN NaN NaN NaN NaN
2018-01-01 00:00:01 NaN NaN NaN NaN NaN
2018-01-01 00:00:02 NaN NaN NaN NaN NaN
2018-01-01 00:00:03 NaN NaN NaN NaN NaN
2018-01-01 00:00:04 0.9
2018-01-01 00:00:05 NaN NaN NaN NaN NaN
2018-01-01 00:00:06 0.6 0.7
2018-01-01 00:00:07 NaN NaN NaN NaN NaN
2018-01-01 00:00:08 NaN NaN NaN NaN NaN
2018-01-01 00:00:09 0.5 0.7 0.8
2018-01-01 00:00:10 NaN NaN NaN NaN NaN
2018-01-01 00:00:11 NaN NaN NaN NaN NaN
2018-01-01 00:00:12 NaN NaN NaN NaN NaN
2018-01-01 00:00:13 NaN NaN NaN NaN NaN
2018-01-01 00:00:14 NaN NaN NaN NaN NaN
2018-01-01 00:00:15 NaN NaN NaN NaN NaN
2018-01-01 00:00:16 2.3 3.6 4.9 5.0
2018-01-01 00:00:17 0.9 3.5 5.5
Option 2: concat
pd.concat([ts, df.set_index('datetime')], axis = 1)
i have this information; where "opid" is categorical
datetime id nut opid user amount
2018-01-01 07:01:00 1531 3hrnd 1 mherrera 1
2018-01-01 07:05:00 9510 sd45f 1 svasqu 1
2018-01-01 07:06:00 8125 5s8fr 15 urubi 1
2018-01-01 07:08:15 6324 sd5d6 1 jgonza 1
2018-01-01 07:12:01 0198 tgfg5 1 julmaf 1
2018-01-01 07:13:50 6589 mbkg4 15 jdjiep 1
2018-01-01 07:16:10 9501 wurf4 15 polga 1
the result i'm looking for is something like this
datetime opid amount
2018-01-01 07:00:00 1 3
2018-01-01 07:00:00 15 1
2018-01-01 07:10:00 1 1
2018-01-01 07:10:00 15 2
so... basically i need to know how many of each "opid" are done every 10 min
P.D "amount" is always 1, "opid" is from 1 - 15
Using grouper:
df.set_index('datetime').groupby(['opid', pd.Grouper(freq='10min')]).amount.sum()
opid datetime
1 2018-01-01 07:00:00 3
2018-01-01 07:10:00 1
15 2018-01-01 07:00:00 1
2018-01-01 07:10:00 2
Name: amount, dtype: int64