Breakdown data-frame into second-by-second time-series - python

I have this dataset of active subjects during specified time-periods.
start end name
0 00:00 00:10 a
1 00:10 00:20 b
2 00:00 00:20 c
3 00:00 00:10 d
4 00:10 00:15 e
5 00:15 00:20 a
The intervals are inclusive on the left(start) side and not inclusive on the right(end).
There are always three subjects active. I want to increase the granularity of the data, so that I will have info of the three active subjects for each second. Each second has three unique values.
This would be the desired result for the test case.
slot1 slot2 slot3
0 a c d
1 a c d
2 a c d
3 a c d
4 a c d
5 a c d
6 a c d
7 a c d
8 a c d
9 a c d
10 b c e
11 b c e
12 b c e
13 b c e
14 b c e
15 b c a
16 b c a
17 b c a
18 b c a
19 b c a
The order of the subjects inside the slots is irrelevant for now. The subjects can reappear in the data like "a" from 00:00 to 00:10 and then again from 00:15 to 00:20. The intervals can be at any second.

Route 1: One (costly but easy) way is to explode the data to the seconds, then merge 3 times:
time_df = (('00:' + df[['start','end']])
.apply(lambda x: pd.to_timedelta(x).dt.total_seconds())
.astype(int)
.apply(lambda x: np.arange(*x), axis=1)
.to_frame('time')
.assign(slot=df['name'])
.explode('time')
)
(time_df.merge(time_df, on='time', suffixes=['1','2'])
.query('slot1 < slot2')
.merge(time_df, on='time')
.query('slot2 < slot')
)
Output:
time slot1 slot2 slot
2 0 a c d
11 1 a c d
20 2 a c d
29 3 a c d
38 4 a c d
47 5 a c d
56 6 a c d
65 7 a c d
74 8 a c d
83 9 a c d
92 10 b c e
101 11 b c e
110 12 b c e
119 13 b c e
128 14 b c e
139 15 a b c
148 16 a b c
157 17 a b c
166 18 a b c
175 19 a b c
Route 2: Another way is to cross merge then query the overlapping intervals:
df[['start','end']] = (('00:' + df[['start','end']])
.apply(lambda x: pd.to_timedelta(x).dt.total_seconds())
.astype(int)
)
(df.merge(df, how='cross')
.assign(start=lambda x: x.filter(like='start').max(axis=1),
end=lambda x: x.filter(like='end').min(axis=1))
.query('start < end & name_x < name_y')
[['name_x','name_y','start','end']]
.merge(df, how='cross')
.assign(start=lambda x: x.filter(like='start').max(axis=1),
end=lambda x: x.filter(like='end').min(axis=1))
.query('start < end & name_y < name')
[['start','end', 'name_x','name_y', 'name']]
)
Output:
start end name_x name_y name
3 0 10 a c d
16 10 15 b c e
38 15 20 a b c
As you can see the this output is just the same as the other, but in the original form. Depending on your data, one route might better than the other.
Update Since your data has exactly 3 slot at any time, you can easily do with pivot. This is the best solution.
# time_df as in Route 1
(time_df.sort_values(['time','slot'])
.assign(nums = lambda x: np.arange(len(x)) % 3)
.pivot('time', 'nums', 'slot')
)
# in general, `.assign(nums=lambda x: x.groupby('time').cumcount()`
# also works instead of the above
Output:
nums 0 1 2
time
0 a c d
1 a c d
2 a c d
3 a c d
4 a c d
5 a c d
6 a c d
7 a c d
8 a c d
9 a c d
10 b c e
11 b c e
12 b c e
13 b c e
14 b c e
15 a b c
16 a b c
17 a b c
18 a b c
19 a b c

This solution uses piso (pandas interval set operations) and will run fast.
setup
Create data and convert to pandas.Timedelta
df = pd.DataFrame(
{
"start": ["00:00", "00:10", "00:00", "00:00", "00:10", "00:15"],
"end": ["00:10", "00:20", "00:20", "00:10", "00:15", "00:20"],
"name": ["a", "b", "c", "d", "e", "a"],
}
)
df[["start", "end"]] = ("00:" + df[["start", "end"]].astype(str)).apply(pd.to_timedelta)
create the sample times (a pandas.TimedeltaIndex of seconds):
sample_times = pd.timedelta_range(df["start"].min(), df["end"].max(), freq="s")
solution
For each possible value of "name" create a pandas.IntervalIndex which has the intervals defined by start and stop columns:
ii_series = df.groupby("name").apply(
lambda d: pd.IntervalIndex.from_arrays(d["start"], d["end"], closed="left")
)
ii_series looks like this:
name
a IntervalIndex([[0 days 00:00:00, 0 days 00:10:...
b IntervalIndex([[0 days 00:10:00, 0 days 00:20:...
c IntervalIndex([[0 days 00:00:00, 0 days 00:20:...
d IntervalIndex([[0 days 00:00:00, 0 days 00:10:...
e IntervalIndex([[0 days 00:10:00, 0 days 00:15:...
dtype: object
Then to each of these interval index we'll apply the piso.contains function, which can be used to test whether a set of points is contained in an interval
contained = ii_series.apply(piso.contains,x=sample_times, result="points")
contained will be a dataframe indexed by the names, and whose columns are the sample times. The transpose of this, looks like:
a b c d e
0 days 00:00:00 True False True True False
0 days 00:00:01 True False True True False
0 days 00:00:02 True False True True False
0 days 00:00:03 True False True True False
0 days 00:00:04 True False True True False
... ... ... ... ... ...
0 days 00:19:56 True True True False False
0 days 00:19:57 True True True False False
0 days 00:19:58 True True True False False
0 days 00:19:59 True True True False False
0 days 00:20:00 False False False False False
This format of data may be easier to work with, depending on the application, but if you want to have it in the format stated in the question then you can create a series of lists, indexed by each second:
series_of_lists = (
contained.transpose()
.melt(ignore_index=False)
.query("value == True")
.reset_index()
.groupby("index")["name"]
.apply(pd.Series.to_list)
)
Then convert to dataframe:
pd.DataFrame(series_of_lists.to_list(), index=series_of_lists.index)
which will look like this:
0 1 2
index
0 days 00:00:00 a c d
0 days 00:00:01 a c d
0 days 00:00:02 a c d
0 days 00:00:03 a c d
0 days 00:00:04 a c d
... .. .. ..
0 days 00:19:55 a b c
0 days 00:19:56 a b c
0 days 00:19:57 a b c
0 days 00:19:58 a b c
0 days 00:19:59 a b c
Note: I am the creator of piso, feel free to reach out if you have any questions.

Related

Python - Count duplicate user Id's occurence in a given month

If I create a Dataframe from
df = pd.DataFrame({"date": ['2022-08-10','2022-08-18','2022-08-18','2022-08-20','2022-08-20','2022-08-24','2022-08-26','2022-08-30','2022-09-3','2022-09-8','2022-09-13'],
"id": ['A','B','C','D','E','B','A','F','G','F','H']})
df['date'] = pd.to_datetime(df['date'])
(Table 1 below showing the data)
I am interested in counting how many times an ID appears in a given month. For example in a given month A, B and F all occur twice whilst everything else occurs once. The difficulty with this data is that the the frequency of dates are not evenly spread out.
I attempted to resample on date by month, with the hope of counting duplicates.
df.resample('M', on='date')['id']
But all the functions that can be used on resample just give me the number of unique occurences rather than how many times each ID occured.
A rough example of the output is below [Table 2]
All of the examples I have seen merely count how many total or unique occurences occur for a given month, this question is focused on finding out how many occurences each Id had in a month.
Thankyou for your time.
[Table 1] - Data
idx
date
id
0
2022-08-10
A
1
2022-08-18
B
2
2022-08-18
C
3
2022-08-20
D
4
2022-08-20
E
5
2022-08-24
B
6
2022-08-26
A
7
2022-08-30
F
8
2022-09-03
G
9
2022-09-08
F
10
2022-09-13
H
[Table 2] - Rough example of desired output
id
occurences in a month
A
2
B
2
C
1
D
1
E
1
F
2
G
1
H
1
Use Series.dt.to_period for month periods and count values per id by GroupBy.size, then aggregate sum:
df1 = (df.groupby(['id', df['date'].dt.to_period('m')])
.size()
.groupby(level=0)
.sum()
.reset_index(name='occurences in a month'))
print (df1)
id occurences in a month
0 A 2
1 B 2
2 C 1
3 D 1
4 E 1
5 F 2
6 G 1
7 H 1
Or use Grouper:
df1 = (df.groupby(['id',pd.Grouper(freq='M', key='date')])
.size()
.groupby(level=0)
.sum()
.reset_index(name='occurences in a month'))
print (df1)
EDIT:
df = pd.DataFrame({"date": ['2022-08-10','2022-08-18','2022-08-18','2022-08-20','2022-08-20','2022-08-24','2022-08-26',
'2022-08-30','2022-09-3','2022-09-8','2022-09-13','2050-12-15'],
"id": ['A','B','C','D','E','B','A','F','G','F','H','H']})
df['date'] = pd.to_datetime(df['date'],format='%Y-%m-%d')
print (df)
Because count first per month or days or dates and sum values it is same like:
df1 = df.groupby('id').size().reset_index(name='occurences')
print (df1)
id occurences
0 A 2
1 B 2
2 C 1
3 D 1
4 E 1
5 F 2
6 G 1
7 H 2
Same sum of counts per id:
df1 = (df.groupby(['id', df['date'].dt.to_period('m')])
.size())
print (df1)
id date
A 2022-08 2
B 2022-08 2
C 2022-08 1
D 2022-08 1
E 2022-08 1
F 2022-08 1
2022-09 1
G 2022-09 1
H 2022-09 1
2050-12 1
dtype: int64
df1 = (df.groupby(['id', df['date'].dt.to_period('d')])
.size())
print (df1)
id date
A 2022-08-10 1
2022-08-26 1
B 2022-08-18 1
2022-08-24 1
C 2022-08-18 1
D 2022-08-20 1
E 2022-08-20 1
F 2022-08-30 1
2022-09-08 1
G 2022-09-03 1
H 2022-09-13 1
2050-12-15 1
dtype: int64
df1 = (df.groupby(['id', df['date'].dt.day])
.size())
print (df1)
id date
A 10 1
26 1
B 18 1
24 1
C 18 1
D 20 1
E 20 1
F 8 1
30 1
G 3 1
H 13 1
15 1
dtype: int64

pandas - take N last values from a group

I have a dataframe that looks something like that (the date is in the format: dd/mm/yyyy):
Param1 Param2 date value
1 a b 30/10/2007 5
2 a b 31/10/2007 8
3 a b 01/11/2007 9
4 a b 01/12/2007 3
5 a b 02/12/2007 2
6 a b 01/03/2008 11
7 b c 05/10/2008 7
8 b c 06/10/2008 13
9 b c 07/10/2008 19
10 b c 08/11/2008 22
11 b c 09/11/2008 35
12 b c 08/12/2008 5
what I need to do, is to group by Param1 and Param2, and to create N (in my case, 3) additional columns for the 3 last previous values, that are at least 30 days away from the current row.
So the output should look something like that:
Param1 Param2 date value prev_1 prev_2 prev_3
1 a b 30/10/2007 5 None None None
2 a b 31/10/2007 8 None None None
3 a b 01/11/2007 9 None None None
4 a b 01/12/2007 3 9 8 5
5 a b 02/12/2007 2 9 8 5
6 a b 01/03/2008 11 2 3 9
7 b c 05/10/2008 7 None None None
8 b c 06/10/2008 13 None None None
9 b c 07/10/2008 19 None None None
10 b c 08/11/2008 22 19 13 7
11 b c 09/11/2008 35 19 13 7
12 b c 08/12/2008 5 22 19 13
I've tried using set_index, stack and related functions, but I just couldn't figure it out (without an ugly for).
Any help will be appreciated!
EDIT: while it is similar to this question: question
It is not exactly the same, because you can't do a simple shift as you need to check the condition of at least 30 days gap.
Here is my suggestion:
data.date = pd.to_datetime(data.date, dayfirst=True)
data['ind'] = data.index
def func(a):
aa = data[(data.ind<a.ind)\
&(data.Param1==a.Param1)&(data.Param2==a.Param2)&(data.date<=(a.date-np.timedelta64(30, 'D')))]
aaa = [np.nan]*3+list(aa.value.values)
aaaa = pd.Series(aaa[::-1][:3], index=['prev_1', 'prev_2', 'prev_3'])
return pd.concat([a, aaaa])
data.apply(func, 1).drop('ind',1)

How to replace part of dataframe in pandas

I have sample dataframe like this
df1=
A B C
a 1 2
b 3 4
b 5 6
c 7 8
d 9 10
I would like to replace a part of this dataframe (col A=a and b) with this dataframe
df2=
A B C
b 9 10
b 11 12
c 13 14
I would like to get result below
df3=
A B C
a 1 2
b 9 10
b 11 12
c 13 14
d 9 10
I tried
df1[df1.A.isin("bc")]...
But I couldnt figure out how to replace.
someone tell how to replace dataframe.
As I explained try update.
import pandas as pd
df1 = pd.DataFrame({"A":['a','b','b','c'], "B":[1,2,4,6], "C":[3,2,1,0]})
df2 = pd.DataFrame({"A":['b','b','c'], "B":[100,400,300], "C":[39,29,100]}).set_index(df1.loc[df1.A.isin(df2.A),:].index)
df1.update(df2)
Out[75]:
A B C
0 a 1.0 3.0
1 b 100.0 39.0
2 b 400.0 29.0
3 c 300.0 100.0
You need combine_first or update by column A, but because duplicates need cumcount:
df1['g'] = df1.groupby('A').cumcount()
df2['g'] = df2.groupby('A').cumcount()
df1 = df1.set_index(['A','g'])
df2 = df2.set_index(['A','g'])
df3 = df2.combine_first(df1).reset_index(level=1, drop=True).astype(int).reset_index()
print (df3)
A B C
0 a 1 2
1 b 9 10
2 b 11 12
3 c 13 14
4 d 9 10
Another solution:
df1['g'] = df1.groupby('A').cumcount()
df2['g'] = df2.groupby('A').cumcount()
df1 = df1.set_index(['A','g'])
df2 = df2.set_index(['A','g'])
df1.update(df2)
df1 = df1.reset_index(level=1, drop=True).astype(int).reset_index()
print (df1)
A B C
0 a 1 2
1 b 9 10
2 b 11 12
3 c 13 14
4 d 9 10
If duplicatesof column A in df1 are same in df2 and have same length:
df2.index = df1.index[df1.A.isin(df2.A)]
df3 = df2.combine_first(df1)
print (df3)
A B C
0 a 1.0 2.0
1 b 9.0 10.0
2 b 11.0 12.0
3 c 13.0 14.0
4 d 9.0 10.0
you could solve your problem with the following:
import pandas as pd
df1 = pd.DataFrame({'A':['a','b','b','c','d'],'B':[1,3,5,7,9],'C':[2,4,6,8,10]})
df2 = pd.DataFrame({'A':['b','b','c'],'B':[9,11,13],'C':[10,12,14]}).set_index(df1.loc[df1.A.isin(df2.A),:].index)
df1.loc[df1.A.isin(df2.A), ['B', 'C']] = df2[['B', 'C']]
Out[108]:
A B C
0 a 1 2
1 b 9 10
2 b 11 12
3 c 13 14
4 d 9 10

Merge dataframes on nearest datetime / timestamp

I have two data frames as follows:
A = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"], "date":["06/22/2014","07/02/2014","01/01/2015","01/01/1991","08/02/1999"]})
B = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"], "date":["02/15/2015","06/30/2014","07/02/1999","10/05/1990","06/24/2014"], "value": ["3","5","1","7","8"] })
Which look like the following:
>>> A
ID date
0 A 2014-06-22
1 A 2014-07-02
2 C 2015-01-01
3 B 1991-01-01
4 B 1999-08-02
>>> B
ID date value
0 A 2015-02-15 3
1 A 2014-06-30 5
2 C 1999-07-02 1
3 B 1990-10-05 7
4 B 2014-06-24 8
I want to merge A with the values of B using the nearest date. In this example, none of the dates match, but it could the the case that some do.
The output should be something like this:
>>> C
ID date value
0 A 06/22/2014 8
1 A 07/02/2014 5
2 C 01/01/2015 3
3 B 01/01/1991 7
4 B 08/02/1999 1
It seems to me that there should be a native function in pandas that would allow this.
Note: as similar question has been asked here
pandas.merge: match the nearest time stamp >= the series of timestamps
You can use reindex with method='nearest' and then merge:
A['date'] = pd.to_datetime(A.date)
B['date'] = pd.to_datetime(B.date)
A.sort_values('date', inplace=True)
B.sort_values('date', inplace=True)
B1 = B.set_index('date').reindex(A.set_index('date').index, method='nearest').reset_index()
print (B1)
print (pd.merge(A,B1, on='date'))
ID_x date ID_y value
0 B 1991-01-01 B 7
1 B 1999-08-02 C 1
2 A 2014-06-22 B 8
3 A 2014-07-02 A 5
4 C 2015-01-01 A 3
You can also add parameter suffixes:
print (pd.merge(A,B1, on='date', suffixes=('_', '')))
ID_ date ID value
0 B 1991-01-01 B 7
1 B 1999-08-02 C 1
2 A 2014-06-22 B 8
3 A 2014-07-02 A 5
4 C 2015-01-01 A 3
pd.merge_asof(A, B, on="date", direction='nearest')

Taking Differences of Records When Status Changes - Pandas

I have customer records with id, timestamp and status.
ID, TS, STATUS
1 10 GOOD
1 20 GOOD
1 25 BAD
1 30 BAD
1 50 BAD
1 600 GOOD
2 40 GOOD
.. ...
I am trying to calculate how much time is spent in consecutive BAD statuses (lets imagine order above is correct) per customer. So for customer id=1, 30-25,50-30,600-50 in total 575 seconds was spent in BAD status.
What is the method of doing this in Pandas? If I calculate .diff() on TS, that would give me differences, but how can I tie that 1) to the customer 2) certain status "blocks" for that customer?
Sample data:
df = pandas.DataFrame({'ID':[1,1,1,1,1,1,2],
'TS':[10,20,25,30,50,600,40],
'Status':['G','G','B','B','B','G','G']
},
columns=['ID','TS','Status'])
Thanks,
In [1]: df = DataFrame({'ID':[1,1,1,1,1,2,2],'TS':[10,20,25,30,50,10,40],'Stat
us':['G','G','B','B','B','B','B']}, columns=['ID','TS','Status'])
In [2]: f = lambda x: x.diff().sum()
In [3]: df['diff'] = df[df.Status=='B'].groupby('ID')['TS'].transform(f)
In [4]: df
Out[4]:
ID TS Status diff
0 1 10 G NaN
1 1 20 G NaN
2 1 25 B 25
3 1 30 B 25
4 1 50 B 25
5 2 10 B 30
6 2 40 B 30
Explanation:
Subset the dataframe to only those records with the desired Status. Groupby the ID and apply the lambda function diff().sum() to each group. Use transform instead of apply because transform returns an indexed series which you can use to assign to a new column 'diff'.
EDIT: New response to account for expanded question scope.
In [1]: df
Out[1]:
ID TS Status
0 1 10 G
1 1 20 G
2 1 25 B
3 1 30 B
4 1 50 B
5 1 600 G
6 2 40 G
In [2]: df['shift'] = -df['TS'].diff(-1)
In [3]: df['diff'] = df[df.Status=='B'].groupby('ID')['shift'].transform('sum')
In [4]: df
Out[4]:
ID TS Status shift diff
0 1 10 G 10 NaN
1 1 20 G 5 NaN
2 1 25 B 5 575
3 1 30 B 20 575
4 1 50 B 550 575
5 1 600 G -560 NaN
6 2 40 G NaN NaN
Here's a solution to separately aggregate each contiguous block of bad status (part 2 of your question?).
In [5]: df = pandas.DataFrame({'ID':[1,1,1,1,1,1,1,1,2,2,2],
'TS':[10,20,25,30,50,600,650,670,40,50,60],
'Status':['G','G','B','B','B','G','B','B','G','B','B']
},
columns=['ID','TS','Status'])
In [6]: grp = df.groupby('ID')
In [7]: def status_change(df):
...: return (df.Status.shift(1) != df.Status).astype(int)
...:
In [8]: df['BlockId'] = grp.apply(lambda df: status_change(df).cumsum())
In [9]: df['Duration'] = grp.TS.diff().shift(-1)
In [10]: df
Out[10]:
ID TS Status BlockId Duration
0 1 10 G 1 10
1 1 20 G 1 5
2 1 25 B 2 5
3 1 30 B 2 20
4 1 50 B 2 550
5 1 600 G 3 50
6 1 650 B 4 20
7 1 670 B 4 NaN
8 2 40 G 1 10
9 2 50 B 2 10
10 2 60 B 2 NaN
In [11]: df[df.Status == 'B'].groupby(['ID', 'BlockId']).Duration.sum()
Out[11]:
ID BlockId
1 2 575
4 20
2 2 10
Name: Duration

Categories

Resources