How to search for time series data across two data frames - python

I have two pandas data frames df1 and df2 like the following:
df1 containing all the data of type a in increasing time order:
type Date
0 a 1970-01-01
1 a 2008-08-01
2 a 2009-07-24
3 a 2010-09-30
4 a 2011-09-29
5 a 2013-06-11
6 a 2013-12-17
7 a 2015-06-02
8 a 2016-06-14
9 a 2017-06-21
10 a 2018-11-26
11 a 2019-06-03
12 a 2019-12-16
df2 containing all the data of type b in increasing time order:
type Date
0 b 2017-11-29
1 b 2018-05-30
2 b 2018-11-26
3 b 2019-06-03
4 b 2019-12-16
5 b 2020-06-18
6 b 2020-12-17
7 b 2021-06-28
A type a entry and a type b entry are determined as matching if the date difference between them is within one year. One type a entry can only match with one other type b entry, and vice versa. Time efficiently, how can I find the maximum amount of matching pairs in increasing time order like the following?
type1 Date1 type2 Date2
0 a 2017-06-21 b 2017-11-29
1 a 2018-11-26 b 2018-05-30
2 a 2019-06-03 b 2018-11-26
3 a 2019-12-16 b 2019-06-03

Use merge_asof:
df3 = pd.merge_asof(df1.rename(columns={'Date':'Date2', 'type':'type1'}),
df2.rename(columns={'Date':'Date1', 'type':'type2'}),
left_on='Date2',
right_on='Date1',
direction='nearest',
allow_exact_matches=False,
tolerance=pd.Timedelta('365 days')).dropna(subset=['Date1'])
print (df3)
type1 Date2 type2 Date1
9 a 2017-06-21 b 2017-11-29
10 a 2018-11-26 b 2018-05-30
11 a 2019-06-03 b 2018-11-26
12 a 2019-12-16 b 2020-06-18

Related

How to calculate monthly changes in a time series using pandas dataframe

As I am new to Python I am probably asking for something basic for most of you. However, I have a df where 'Date' is the index, another column that is returning the month related to the Date, and one Data column.
Mnth TSData
Date
2012-01-05 1 192.6257
2012-01-12 1 194.2714
2012-01-19 1 192.0086
2012-01-26 1 186.9729
2012-02-02 2 183.7700
2012-02-09 2 178.2343
2012-02-16 2 172.3429
2012-02-23 2 171.7800
2012-03-01 3 169.6300
2012-03-08 3 168.7386
2012-03-15 3 167.1700
2012-03-22 3 165.9543
2012-03-29 3 165.0771
2012-04-05 4 164.6371
2012-04-12 4 164.6500
2012-04-19 4 166.9171
2012-04-26 4 166.4514
2012-05-03 5 166.3657
2012-05-10 5 168.2543
2012-05-17 5 176.8271
2012-05-24 5 179.1971
2012-05-31 5 183.7120
2012-06-07 6 195.1286
I wish to calculate monthly changes in the data set that I can later use in a boxplot. So from the table above the results i seek are:
Mnth Chng
1 -8,9 (183,77 - 192,66)
2 -14,14 (169,63 - 183,77)
3 -5 (164,63 - 169,63)
4 1,73 (166,36 - 164,63)
5 28,77 (195,13 - 166,36)
and so on...
any suggestions?
thanks :)
IIUC, starting from this as df:
Date Mnth TSData
0 2012-01-05 1 192.6257
1 2012-01-12 1 194.2714
2 2012-01-19 1 192.0086
3 2012-01-26 1 186.9729
4 2012-02-02 2 183.7700
...
20 2012-05-24 5 179.1971
21 2012-05-31 5 183.7120
22 2012-06-07 6 195.1286
you can use:
df.groupby('Mnth')['TSData'].first().diff().shift(-1)
# or
# -df.groupby('Mnth')['TSData'].first().diff(-1)
NB. the data must be sorted by date to have the desired date to be used in the computation as the first item of each group (df.sort_values(by=['Mnth', 'Date']))
output:
Mnth
1 -8.8557
2 -14.1400
3 -4.9929
4 1.7286
5 28.7629
6 NaN
Name: TSData, dtype: float64
I'll verify that we have a datetime index:
df.index = pd.to_datetime(df.index)
Then it's simply a matter of using resample:
df['TSData'].resample('M').first().diff().shift(freq='-1M')
Output:
Date
2011-12-31 NaN
2012-01-31 -8.8557
2012-02-29 -14.1400
2012-03-31 -4.9929
2012-04-30 1.7286
2012-05-31 28.7629
Name: TSData, dtype: float64

merge to replace nans of the same column in pandas dataframe?

I have the following dataframe to which I want to merge multiple dataframes to, this df consist of ID, date, and many other variables..
ID date ..other variables...
A 2017Q1
A 2017Q2
A 2018Q1
B 2017Q1
B 2017Q2
B 2017Q3
C 2018Q1
C 2018Q2
.. ..
And i have a bunch of dataframes (by quarter) that has asset holdings information
df_2017Q1:
ID date asset_holdings
A 2017Q1 1
B 2017Q1 2
C 2017Q1 4
...
df_2017Q2
ID date asset_holdings
A 2017Q2 2
B 2017Q2 5
C 2017Q2 4
...
df_2017Q3
ID date asset_holdings
A 2017Q3 1
B 2017Q3 2
C 2017Q3 10
...
df_2017Q4..
ID date asset_holdings
A 2017Q4 10
B 2017Q4 20
C 2017Q4 14
...
df_2018Q1..
ID date asset_holdings
A 2018Q1 11
B 2018Q1 23
C 2018Q1 15
...
df_2018Q2...
ID date asset_holdings
A 2018Q2 11
B 2018Q2 26
C 2018Q2 19
...
....
desired output
ID date asset_holdings ..other variables...
A 2017Q1 1
A 2017Q2 2
A 2018Q1 11
B 2017Q1 2
B 2017Q2 5
B 2017Q3 2
C 2018Q1 15
C 2018Q2 19
.. ..
I think merging on ID and date, should do it but this will create + n columns which I do not want, so I want to create a column "asset_holdings" and merge the right dfs while updating NAN values. But not sure if this is the smartest way. Any help will be appreciated!
Try to use pd.concat() to concatenate your different DataFrames and then use sort_values(['ID', 'date']) to sort the values by the columns ID and date.
See the example below as demonstration.
import pandas as pd
df1 = pd.DataFrame({'ID':list('ABCD'), 'date':['2017Q1']*4, 'other':[1,2,3,4]})
df2 = pd.DataFrame({'ID':list('ABCD'), 'date':['2017Q2']*4, 'other':[4,3,2,1]})
df3 = pd.DataFrame({'ID':list('ABCD'), 'date':['2018Q1']*4, 'other':[7,6,5,4]})
ans = pd.concat([df1, df2, df3]).sort_values(['ID', 'date'], ignore_index=True)
>>> ans
ID date other
0 A 2017Q1 1
1 A 2017Q2 4
2 A 2018Q1 7
3 B 2017Q1 2
4 B 2017Q2 3
5 B 2018Q1 6
6 C 2017Q1 3
7 C 2017Q2 2
8 C 2018Q1 5
9 D 2017Q1 4
10 D 2017Q2 1
11 D 2018Q1 4

Pandas - Times series multiple slices of a dataframe groupby Id

What I have:
A dataframe, df consists of 3 columns (Id, Item and Timestamp). Each subject has unique Id with recorded Item on a particular date and time (Timestamp). The second dataframe, df_ref consists of date time range reference for slicing the df, the Start and the End for each subject, Id.
df:
Id Item Timestamp
0 1 aaa 2011-03-15 14:21:00
1 1 raa 2012-05-03 04:34:01
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
5 1 aud 2017-05-10 11:58:02
6 2 boo 2004-06-22 22:20:58
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
9 2 baa 2008-05-22 21:28:00
10 2 boo 2017-06-08 23:31:06
11 3 ige 2011-06-30 13:14:21
12 3 afr 2013-06-11 01:38:48
13 3 gui 2013-06-21 23:14:26
14 3 loo 2014-06-10 15:15:42
15 3 boo 2015-01-23 02:08:35
16 3 afr 2015-04-15 00:15:23
17 3 aaa 2016-02-16 10:26:03
18 3 aaa 2016-06-10 01:11:15
19 3 ige 2016-07-18 11:41:18
20 3 boo 2016-12-06 19:14:00
21 4 gui 2016-01-05 09:19:50
22 4 aaa 2016-12-09 14:49:50
23 4 ige 2016-11-01 08:23:18
df_ref:
Id Start End
0 1 2013-03-12 00:00:00 2016-05-30 15:20:36
1 2 2005-06-05 08:51:22 2007-02-24 00:00:00
2 3 2011-05-14 10:11:28 2013-12-31 17:04:55
3 3 2015-03-29 12:18:31 2016-07-26 00:00:00
What I want:
Slice the df dataframe based on the data time range given for each Id (groupby Id) in df_ref and concatenate the sliced data into new dataframe. However, a subject could have more than one date time range (in this example Id=3 has 2 date time range).
df_expected:
Id Item Timestamp
0 1 baa 2013-05-08 22:21:29
1 1 boo 2015-12-24 21:53:41
2 1 afr 2016-04-14 12:28:26
3 2 aaa 2005-11-16 07:00:00
4 2 ige 2006-06-28 17:09:18
5 3 ige 2011-06-30 13:14:21
6 3 afr 2013-06-11 01:38:48
7 3 gui 2013-06-21 23:14:26
8 3 afr 2015-04-15 00:15:23
9 3 aaa 2016-02-16 10:26:03
10 3 aaa 2016-06-10 01:11:15
11 3 ige 2016-07-18 11:41:18
What I have done so far:
I referred to this post (Time series multiple slice) while doing my code. I modify the code since it does not have the groupby element which I need.
My code:
from datetime import datetime
df['Timestamp'] = pd.to_datetime(df.Timestamp, format='%Y-%m-%d %H:%M')
x = pd.DataFrame()
for pid in def_ref.Id.unique():
selection = df[(df['Id']== pid) & (df['Timestamp']>= def_ref['Start']) & (df['Timestamp']<= def_ref['End'])]
x = x.append(selection)
Above code give error:
ValueError: Can only compare identically-labeled Series objects
First use merge with default inner join, also it create all combinations for duplicated Id. Then filter by between and DataFrame.loc for filtering by conditions and by df1.columns in one step:
df1 = df.merge(df_ref, on='Id')
df2 = df1.loc[df1['Timestamp'].between(df1['Start'], df1['End']), df.columns]
print (df2)
Id Item Timestamp
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
11 3 ige 2011-06-30 13:14:21
13 3 afr 2013-06-11 01:38:48
15 3 gui 2013-06-21 23:14:26
22 3 afr 2015-04-15 00:15:23
24 3 aaa 2016-02-16 10:26:03
26 3 aaa 2016-06-10 01:11:15
28 3 ige 2016-07-18 11:41:18

How to calculate time difference by group using pandas?

Problem
I want to calculate diff by group. And I don’t know how to sort the time column so that each group results are sorted and positive.
The original data :
In [37]: df
Out[37]:
id time
0 A 2016-11-25 16:32:17
1 A 2016-11-25 16:36:04
2 A 2016-11-25 16:35:29
3 B 2016-11-25 16:35:24
4 B 2016-11-25 16:35:46
The result I want
Out[40]:
id time
0 A 00:35
1 A 03:12
2 B 00:22
notice: the type of time col is timedelta64[ns]
Trying
In [38]: df['time'].diff(1)
Out[38]:
0 NaT
1 00:03:47
2 -1 days +23:59:25
3 -1 days +23:59:55
4 00:00:22
Name: time, dtype: timedelta64[ns]
Don't get desired result.
Hope
Not only solve the problem but the code can run fast because there are 50 million rows.
You can use sort_values with groupby and aggregating diff:
df['diff'] = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
id time diff
0 A 2016-11-25 16:32:17 NaT
1 A 2016-11-25 16:36:04 00:00:35
2 A 2016-11-25 16:35:29 00:03:12
3 B 2016-11-25 16:35:24 NaT
4 B 2016-11-25 16:35:46 00:00:22
If need remove rows with NaT in column diff use dropna:
df = df.dropna(subset=['diff'])
print (df)
id time diff
2 A 2016-11-25 16:35:29 00:03:12
1 A 2016-11-25 16:36:04 00:00:35
4 B 2016-11-25 16:35:46 00:00:22
You can also overwrite column:
df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
id time
0 A NaT
1 A 00:00:35
2 A 00:03:12
3 B NaT
4 B 00:00:22
df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
df = df.dropna(subset=['time'])
print (df)
id time
1 A 00:00:35
2 A 00:03:12
4 B 00:00:22

Pandas merge column where between dates

I have two dataframes - one of calls made to customers and another identifying active service durations by client. Each client can have multiple services, but they will not overlap.
df_calls = pd.DataFrame([['A','2016-02-03',1],['A','2016-05-11',2],['A','2016-10-01',3],['A','2016-11-02',4],
['B','2016-01-10',5],['B','2016-04-25',6]], columns = ['cust_id','call_date','call_id'])
print df_calls
cust_id call_date call_id
0 A 2016-02-03 1
1 A 2016-05-11 2
2 A 2016-10-01 3
3 A 2016-11-02 4
4 B 2016-01-10 5
5 B 2016-04-25 6
and
df_active = pd.DataFrame([['A','2016-01-10','2016-03-15',1],['A','2016-09-10','2016-11-15',2],
['B','2016-01-02','2016-03-17',3]], columns = ['cust_id','service_start','service_end','service_id'])
print df_active
cust_id service_start service_end service_id
0 A 2016-01-10 2016-03-15 1
1 A 2016-09-10 2016-11-15 2
2 B 2016-01-02 2016-03-17 3
I need to find the service_id each calls belongs to, identified by service_start and service_end dates. If a call does not fall between dates, they should remain in the dataset.
Here's what I tried so far:
df_test_output = pd.merge(df_calls,df_active, how = 'left',on = ['cust_id'])
df_test_output = df_test_output[(df_test_output['call_date']>= df_test_output['service_start'])
& (df_test_output['call_date']<= df_test_output['service_end'])].drop(['service_start','service_end'],axis = 1)
print df_test_output
cust_id call_date call_id service_id
0 A 2016-02-03 1 1
5 A 2016-10-01 3 2
7 A 2016-11-02 4 2
8 B 2016-01-10 5 3
This drops all the calls that were not between service dates. Any thoughts on how I can merge on the service_id where it meets the criteria, but retain the remaining records?
The result should look like this:
#do black magic
print df_calls
cust_id call_date call_id service_id
0 A 2016-02-03 1 1.0
1 A 2016-05-11 2 NaN
2 A 2016-10-01 3 2.0
3 A 2016-11-02 4 2.0
4 B 2016-01-10 5 3.0
5 B 2016-04-25 6 NaN
You can use merge with left join:
print (pd.merge(df_calls, df_calls2, how='left'))
cust_id call_date call_id service_id
0 A 2016-02-03 1 1.0
1 A 2016-05-11 2 NaN
2 A 2016-10-01 3 2.0
3 A 2016-11-02 4 2.0
4 B 2016-01-10 5 3.0
5 B 2016-04-25 6 NaN

Categories

Resources