Merging dataframes based on date range - python

I have two pandas dataframes: one (df1) with three columns (StartDate, EndDate, and ID) and a second (df2) with a Date. I want to merge df1 and df2 based on df2.Date between df1.StartDate and df2.EndDate.
Each date range in df1 is unique and doesn't overlap with any of the other rows in the dataframe.
Dates are formatted YYYY-MM-DD.

Just to provide an alternative way using np.piecewise. The performance is even faster than np.searchedsort.
import pandas as pd
import numpy as np
# data
# ====================================
df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 'ID': np.arange(1, 10, 1)})
df2 = pd.DataFrame(dict(values=np.random.randn(50), date_time=pd.date_range('2010-01-01', periods=50, freq='D')))
df1.StartDate
Out[139]:
0 2010-01-01
1 2010-01-06
2 2010-01-11
3 2010-01-16
4 2010-01-21
5 2010-01-26
6 2010-01-31
7 2010-02-05
8 2010-02-10
Name: StartDate, dtype: datetime64[ns]
df2.date_time
Out[140]:
0 2010-01-01
1 2010-01-02
2 2010-01-03
3 2010-01-04
4 2010-01-05
5 2010-01-06
6 2010-01-07
7 2010-01-08
8 2010-01-09
9 2010-01-10
...
40 2010-02-10
41 2010-02-11
42 2010-02-12
43 2010-02-13
44 2010-02-14
45 2010-02-15
46 2010-02-16
47 2010-02-17
48 2010-02-18
49 2010-02-19
Name: date_time, dtype: datetime64[ns]
df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)
Out[143]:
date_time values ID_matched
0 2010-01-01 -0.2240 1
1 2010-01-02 -0.4202 1
2 2010-01-03 0.9998 1
3 2010-01-04 0.4310 1
4 2010-01-05 -0.6509 0
5 2010-01-06 -1.4987 2
6 2010-01-07 -1.2306 2
7 2010-01-08 0.1940 2
8 2010-01-09 -0.9984 2
9 2010-01-10 -0.3676 0
.. ... ... ...
40 2010-02-10 0.5242 9
41 2010-02-11 0.3451 9
42 2010-02-12 0.7244 9
43 2010-02-13 -2.0404 9
44 2010-02-14 -1.0798 0
45 2010-02-15 -0.6934 0
46 2010-02-16 -2.3380 0
47 2010-02-17 1.6623 0
48 2010-02-18 -0.2754 0
49 2010-02-19 -0.7466 0
[50 rows x 3 columns]
%timeit df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)
1000 loops, best of 3: 466 µs per loop

Minor correction to #JianxunLi answer. Bit too involved for a comment.
This uses the len(funclist) == len(condlist) + 1 property of piecewise to assign a default value for when there is no match. Otherwise the default no-match value is zero, which can cause problems...
### Data / inits
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 'ID': np.arange(1, 10, 1)})
df2 = pd.DataFrame(dict(values=np.random.randn(50), date_time=pd.date_range('2010-01-01', periods=50, freq='D')))
### Processing
valIfNoMatch = np.nan
df2['ID_matched'] = np.piecewise(np.zeros(len(df2)),\
[(df2.date_time.values >= start_date)&(df2.date_time.values < end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)],\
np.append(df1.ID.values, valIfNoMatch))
PS. Also corrected the typo testing both >= & <=; a timestamp on an exact boundary between intervals would return true for two different intervals, which breaks a key assumption of the method.

conditional_join from pyjanitor may be helpful in the abstraction/convenience; the function is currently in dev:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
Reusing jianxun-li's data:
np.random.seed(123)
df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'),
'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'),
'ID': np.arange(1, 10, 1)})
df2 = pd.DataFrame(dict(values=np.random.randn(50),
date_time=pd.date_range('2010-01-01', periods=50, freq='D')))
df2.conditional_join(
df1,
('date_time', 'StartDate', '>='),
('date_time', 'EndDate', '<=')
)
left right
values date_time StartDate EndDate ID
0 -1.085631 2010-01-01 2010-01-01 2010-01-04 1
1 0.997345 2010-01-02 2010-01-01 2010-01-04 1
2 0.282978 2010-01-03 2010-01-01 2010-01-04 1
3 -1.506295 2010-01-04 2010-01-01 2010-01-04 1
4 1.651437 2010-01-06 2010-01-06 2010-01-09 2
5 -2.426679 2010-01-07 2010-01-06 2010-01-09 2
6 -0.428913 2010-01-08 2010-01-06 2010-01-09 2
7 1.265936 2010-01-09 2010-01-06 2010-01-09 2
8 -0.678886 2010-01-11 2010-01-11 2010-01-14 3
9 -0.094709 2010-01-12 2010-01-11 2010-01-14 3
10 1.491390 2010-01-13 2010-01-11 2010-01-14 3
11 -0.638902 2010-01-14 2010-01-11 2010-01-14 3
12 -0.434351 2010-01-16 2010-01-16 2010-01-19 4
13 2.205930 2010-01-17 2010-01-16 2010-01-19 4
14 2.186786 2010-01-18 2010-01-16 2010-01-19 4
15 1.004054 2010-01-19 2010-01-16 2010-01-19 4
16 0.737369 2010-01-21 2010-01-21 2010-01-24 5
17 1.490732 2010-01-22 2010-01-21 2010-01-24 5
18 -0.935834 2010-01-23 2010-01-21 2010-01-24 5
19 1.175829 2010-01-24 2010-01-21 2010-01-24 5
20 -0.637752 2010-01-26 2010-01-26 2010-01-29 6
21 0.907105 2010-01-27 2010-01-26 2010-01-29 6
22 -1.428681 2010-01-28 2010-01-26 2010-01-29 6
23 -0.140069 2010-01-29 2010-01-26 2010-01-29 6
24 -0.255619 2010-01-31 2010-01-31 2010-02-03 7
25 -2.798589 2010-02-01 2010-01-31 2010-02-03 7
26 -1.771533 2010-02-02 2010-01-31 2010-02-03 7
27 -0.699877 2010-02-03 2010-01-31 2010-02-03 7
28 -0.173636 2010-02-05 2010-02-05 2010-02-08 8
29 0.002846 2010-02-06 2010-02-05 2010-02-08 8
30 0.688223 2010-02-07 2010-02-05 2010-02-08 8
31 -0.879536 2010-02-08 2010-02-05 2010-02-08 8
32 -0.805367 2010-02-10 2010-02-10 2010-02-13 9
33 -1.727669 2010-02-11 2010-02-10 2010-02-13 9
34 -0.390900 2010-02-12 2010-02-10 2010-02-13 9
35 0.573806 2010-02-13 2010-02-10 2010-02-13 9
Under the hood it uses np.searchsorted (binary search).
Note that pd.IntervalIndex is a more efficient option, especially when the intervals are not overlapping.

Related

Adding a year to a period?

I have a column which I have converted to dateime:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
date
2021-10-21 00:00:00
2021-10-24 00:00:00
2021-10-25 00:00:00
2021-10-26 00:00:00
And I need to add 1 year to this time based on a conditional:
df.loc[df['quarter'] == "Q4_", 'date'] + pd.offsets.DateOffset(years=1)
but it's not working....
date
2021-10-21 00:00:00
2021-10-24 00:00:00
2021-10-25 00:00:00
2021-10-26 00:00:00
I have tried converting it to period since I only need the year to be used in a concatenation later:
df['year'] = df['date'].dt.to_period('Y')
but I cannot add any number to a period.
This appears to be working for me:
import pandas as pd
df = pd.DataFrame({'date':pd.date_range('1/1/2021', periods=50, freq='M')})
print(df.head(24))
Input:
date
0 2021-01-31
1 2021-02-28
2 2021-03-31
3 2021-04-30
4 2021-05-31
5 2021-06-30
6 2021-07-31
7 2021-08-31
8 2021-09-30
9 2021-10-31
10 2021-11-30
11 2021-12-31
12 2022-01-31
13 2022-02-28
14 2022-03-31
15 2022-04-30
16 2022-05-31
17 2022-06-30
18 2022-07-31
19 2022-08-31
20 2022-09-30
21 2022-10-31
22 2022-11-30
23 2022-12-31
Add, year:
df.loc[df['date'].dt.quarter == 4, 'date'] += pd.offsets.DateOffset(years=1)
print(df.head(24))
Note per your logic, the year increase on October.
Output:
date
0 2021-01-31
1 2021-02-28
2 2021-03-31
3 2021-04-30
4 2021-05-31
5 2021-06-30
6 2021-07-31
7 2021-08-31
8 2021-09-30
9 2022-10-31
10 2022-11-30
11 2022-12-31
12 2022-01-31
13 2022-02-28
14 2022-03-31
15 2022-04-30
16 2022-05-31
17 2022-06-30
18 2022-07-31
19 2022-08-31
20 2022-09-30
21 2023-10-31
22 2023-11-30
23 2023-12-31

Loop through rows of dataframe using re.compile().split()

I have a dataframe taht consists of 1 column and several rows. Each of these rows is constructed in the same way: -timestamp- value1 value2 value3 -timestamp- value 4 value5 value6 ...
The timestamps have this format: YYYY-MM-DD HH:MM:SS and the values are number with 2 decimals.
I would like to make a new dataframe that has the individual timestamps in one row and the related values in the next row.
I managed to get the expected result linewise with regex but not for the entire dataframe.
My code so far:
#input dataframe
data.head()
values
0 2020-05-12 10:00:00 12.07 13 11.56 ... 2020-05-12 10:00:01 11.49 17 5.67...
1 2020-05-12 10:01:00 11.49 17 5.67 ... 2020-05-12 10:01:01 12.07 13 11.56...
2 2020-05-12 10:02:00 14.29 18 11.28 ... 2020-05-12 10:02:01 13.77 18 7.43...
test = data['values'].iloc[0] #first row of data
row1 = re.compile("(\d\d\d\d\S\d\d\S\d\d\s\d\d\S\d\d\S\d\d)").split(test)
df_row1 = pd.DataFrame(row1)
df_row1.head()
values
0 2020-05-12 10:00:00
1 12.07 13.79 15.45 17.17 18.91 14.91 12.35 14....
2 2020-05-12 10:00:01
3 12.48 13.96 13.88 15.57 18.46 15.0 13.65 14.6...
#trying the same for the entire dataframe
for row in data:
df_new = re.compile("(\d\d\d\d\S\d\d\S\d\d\s\d\d\S\d\d\S\d\d)").split(row)
print(df_new)
['values']
My question now is how can I loop through the rows of my dataframe and get the expected result?
In case you want to first split the lines and extract the values to columns, be aware you can use str.extract. Using named grouping in your regular expression it will automatically assign the columns for your dataframe
split_line = r"\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})"
extract_values = r"(?P<date>\d{4}-\d{2}-\d{2})\s(?P<time>\d{2}:\d{2}:\d{2})\s(?P<value_one>.*?)\s(?P<value_two>.*?)\s(?P<value_three>.*?)$"
df = pd.DataFrame([{
"value": "2020-05-12 10:00:00 12.07 13 11.56 2020-06-12 11:00:00 13.07 16 11.16 2020-05-12 10:00:01 11.49 17 5.67",
},{
"value": "2020-05-13 10:00:00 14.07 13 15.56 2020-05-16 10:00:02 11.51 18 5.69",
}])
df = df["value"].str.split(split_line).explode().str.extract(extract_values, expand=True)
print(df)
# date time value_one value_two value_three
# 0 2020-05-12 10:00:00 12.07 13 11.56
# 0 2020-06-12 11:00:00 13.07 16 11.16
# 0 2020-05-12 10:00:01 11.49 17 5.67
# 1 2020-05-13 10:00:00 14.07 13 15.56
# 1 2020-05-16 10:00:02 11.51 18 5.69
In case you do not know the number of groups after the date and time use split rather than a regular expression. I would suggest something like this:
split_line = r"\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})"
df = pd.DataFrame([{
"value": "2020-05-12 10:00:00 12.07 13 11.56 2020-06-12 11:00:00 13.07 16 11.16 2020-05-12 10:00:01 11.49 17 5.67",
},{
"value": "2020-05-13 10:00:00 14.07 13 14 15 15.56 2020-05-16 10:00:02 11.51 18 5.69",
}])
df = df["value"].str.split(split_line).explode().reset_index()
df = df['value'].str.split(" ").apply(pd.Series)
df.columns = [f"col_{col}" for col in df.columns]
print(df)
# col_0 col_1 col_2 col_3 col_4 col_5 col_6
# 0 2020-05-12 10:00:00 12.07 13 11.56 NaN NaN
# 1 2020-06-12 11:00:00 13.07 16 11.16 NaN NaN
# 2 2020-05-12 10:00:01 11.49 17 5.67 NaN NaN
# 3 2020-05-13 10:00:00 14.07 13 14 15 15.56
# 4 2020-05-16 10:00:02 11.51 18 5.69 NaN NaN
You don't need to loop through the rows to get the result instead, you can use Series.str.split to split the given series around the delimiter, the delimiter in this case would be a regular expression. Then you can use DataFrame.explode to transform each element in a list-likes to seperate rows.
Use:
data["values"] = data["values"].str.split(r'\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})')
data = data.explode("values")
data["values"] = data["values"].str.split(r'(?<=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})\s+')
data = data.explode("values").reset_index(drop=True)
print(data)
This resulting dataframe data should look like:
values
0 2020-05-12 10:00:00
1 12.07 13 11.56
2 2020-05-12 10:00:01
3 11.49 17 5.67
4 2020-05-12 10:01:00
5 11.49 17 5.67
6 2020-05-12 10:01:01
7 12.07 13 11.56
8 2020-05-12 10:02:00
9 14.29 18 11.28
10 2020-05-12 10:02:01
11 13.77 18 7.43

How to join a table with each group of a dataframe in pandas

I have a dataframe like below. Each date is Monday of each week.
df = pd.DataFrame({'date' :['2020-04-20', '2020-05-11','2020-05-18',
'2020-04-20', '2020-04-27','2020-05-04','2020-05-18'],
'name': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'count': [23, 44, 125, 6, 9, 10, 122]})
date name count
0 2020-04-20 A 23
1 2020-05-11 A 44
2 2020-05-18 A 125
3 2020-04-20 B 6
4 2020-04-27 B 9
5 2020-05-04 B 10
6 2020-05-18 B 122
Neither 'A' and 'B' covers the whole date range. Both of them have some missing dates, which means the counts on that week is 0. Below is all the dates:
df_dates = pd.DataFrame({ 'date':['2020-04-20', '2020-04-27','2020-05-04','2020-05-11','2020-05-18'] })
So what I need is like the dataframe below:
date name count
0 2020-04-20 A 23
1 2020-04-27 A 0
2 2020-05-04 A 0
3 2020-05-11 A 44
4 2020-05-18 A 125
5 2020-04-20 B 6
6 2020-04-27 B 9
7 2020-05-04 B 10
8 2020-05-11 B 0
9 2020-05-18 B 122
It seems like I need to join (merge) df_dates with df for each name group ( A and B) and then fill the data with missing name and missing count value with 0's. Does anyone know achieve that? how I can join with another table with a grouped table?
I tried and no luck...
pd.merge(df_dates, df.groupby('name'), how='left', on='date')
We can do reindex with multiple index creation
idx=pd.MultiIndex.from_product([df_dates.date,df.name.unique()],names=['date','name'])
s=df.set_index(['date','name']).reindex(idx,fill_value=0).reset_index().sort_values('name')
Out[136]:
date name count
0 2020-04-20 A 23
2 2020-04-27 A 0
4 2020-05-04 A 0
6 2020-05-11 A 44
8 2020-05-18 A 125
1 2020-04-20 B 6
3 2020-04-27 B 9
5 2020-05-04 B 10
7 2020-05-11 B 0
9 2020-05-18 B 122
Or
s=df.pivot(*df.columns).reindex(df_dates.date).fillna(0).reset_index().melt('date')
Out[145]:
date name value
0 2020-04-20 A 23.0
1 2020-04-27 A 0.0
2 2020-05-04 A 0.0
3 2020-05-11 A 44.0
4 2020-05-18 A 125.0
5 2020-04-20 B 6.0
6 2020-04-27 B 9.0
7 2020-05-04 B 10.0
8 2020-05-11 B 0.0
9 2020-05-18 B 122.0
If you are looking for just fill in the union of dates in df, you can do:
(df.set_index(['date','name'])
.unstack('date',fill_value=0)
.stack().reset_index()
)
Output:
name date count
0 A 2020-04-20 23
1 A 2020-04-27 0
2 A 2020-05-04 0
3 A 2020-05-11 44
4 A 2020-05-18 125
5 B 2020-04-20 6
6 B 2020-04-27 9
7 B 2020-05-04 10
8 B 2020-05-11 0
9 B 2020-05-18 122

How to copy paste values from another dataset conditional on a column

I have df1
Id Data Group_Id
0 1 A 1
1 2 B 2
2 3 B 3
...
100 4 A 101
101 5 A 102
...
and df2
Timestamp Group_Id
2012-01-01 00:00:05.523 1
2013-07-01 00:00:10.757 2
2014-01-12 00:00:15.507. 3
...
2016-03-05 00:00:05.743 101
2017-12-24 00:00:10.407 102
...
I want to match the 2 datasets by Group_Id, then copy only date from Timestamp in df2 and paste to a new column in df1 based on corresponding Group_Id, name the column day1.
Then I want to add 6 more columns next to day1, name them day2, ..., day7 with the next six days based on day1. So it looks like:
Id Data Group_Id day1 day2 day3 ... day7
0 1 A 1 2012-01-01 2012-01-02 2012-01-03 ...
1 2 B 2 2013-07-01 2013-07-02 2013-07-03 ...
2 3 B 3 2014-01-12 2014-01-13 2014-01-14 ...
...
100 4 A 101 2016-03-05 2016-03-06 2016-03-07 ...
101 5 A 102 2017-12-24 2017-12-25 2017-12-26 ...
...
Thanks.
First we need merge here
df1=df1.merge(df2,how='left')
s=pd.DataFrame([pd.date_range(x,periods=6,freq ='D') for x in df1.Timestamp],index=df1.index)
s.columns+=1
df1.join(s.add_prefix('Day'))
another approach here, basically just merges the dfs, grabs the date from the timestamp and makes 6 new columns adding a day each time:
import pandas as pd
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df3 = df1.merge(df2, on='Group_Id')
df3['Timestamp'] = pd.to_datetime(df3['Timestamp']) #only necessary if not already timestamp
df3['day1'] = df3['Timestamp'].dt.date
for i in (range(1,7)):
df3['day'+str(i+1)] = df3['day1'] + pd.Timedelta(i,unit='d')
output:
Id Data Group_Id Timestamp day1 day2 day3 day4 day5 day6 day7
0 1 A 1 2012-01-01 00:00:05.523 2012-01-01 2012-01-02 2012-01-03 2012-01-04 2012-01-05 2012-01-06 2012-01-07
1 2 B 2 2013-07-01 00:00:10.757 2013-07-01 2013-07-02 2013-07-03 2013-07-04 2013-07-05 2013-07-06 2013-07-07
2 3 B 3 2014-01-12 00:00:15.507 2014-01-12 2014-01-13 2014-01-14 2014-01-15 2014-01-16 2014-01-17 2014-01-18
3 4 A 101 2016-03-05 00:00:05.743 2016-03-05 2016-03-06 2016-03-07 2016-03-08 2016-03-09 2016-03-10 2016-03-11
4 5 A 102 2017-12-24 00:00:10.407 2017-12-24 2017-12-25 2017-12-26 2017-12-27 2017-12-28 2017-12-29 2017-12-30
note that I copied your data frame into a csv and only had the 5 entires so the index is not the same as your example (i.e. 100, 101)
you can delete the timestamp col if not needed

Aggregate to 15min based timestamp to hour and find sum, avg and max for multiple columns in pandas

I have a dataframe with period_start_time by every 15 minutes and now I need to aggregate to 1 hour and calculate sum and avg for almost every column in dataframe (it has about 20 columns) and
PERIOD_START_TIME ID val1 val2
06.21.2017 22:15:00 12 3 0
06.21.2017 22:30:00 12 5 6
06.21.2017 22:45:00 12 0 3
06.21.2017 23:00:00 12 5 2
...
06.21.2017 22:15:00 15 9 2
06.21.2017 22:30:00 15 0 2
06.21.2017 22:45:00 15 1 5
06.21.2017 23:00:00 15 0 1
...
Desired output:
PERIOD_START_TIME ID val1(avg) val1(sum) val1(max) ...
06.21.2017 22:00:00 12 3.25 13 5
...
06.21.2017 23:00:00 15 2.25 10 9 ...
And for columns val2 too, and for every other column in dataframe.
I have no idea how to group by period start time for every hour, not for the whole day, no idea how to start.
I believe you need Series.dt.floor for Hours and then aggregate by agg:
df = df.groupby([df['PERIOD_START_TIME'].dt.floor('H'),'ID']).agg(['mean','sum', 'max'])
#for columns from MultiIndex
df.columns = df.columns.map('_'.join)
print (df)
val1_mean val1_sum val1_max val2_mean val2_sum \
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 2.666667 8 5 3 9
15 3.333333 10 9 3 9
2017-06-21 23:00:00 12 5.000000 5 5 2 2
15 0.000000 0 0 1 1
val2_max
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 6
15 5
2017-06-21 23:00:00 12 2
15 1
df = df.reset_index()
print (df)
PERIOD_START_TIME ID val1_mean val1_sum val1_max val2_mean val2_sum \
0 2017-06-21 22:00 12 2.666667 8 5 3 9
1 2017-06-21 22:00 15 3.333333 10 9 3 9
2 2017-06-21 23:00 12 5.000000 5 5 2 2
3 2017-06-21 23:00 15 0.000000 0 0 1 1
val2_max
0 6
1 5
2 2
3 1
Very similarly you can convert PERIOD_START_TIME to a pandas Period.
df['PERIOD_START_TIME'] = df['PERIOD_START_TIME'].dt.to_period('H')
df.groupby(['PERIOD_START_TIME', 'ID']).agg(['max', 'min', 'mean']).reset_index()

Categories

Resources