Get row with closest date in other dataframe pandas - python

I have 2 dataframes
Dataframe1:
id date1
1 11-04-2022
1 03-02-2011
2 03-05-2222
3 01-01-2001
4 02-02-2012
and Dataframe2:
id date2 data data2
1 11-02-2222 1 3
1 11-02-1999 3 4
1 11-03-2022 4 5
2 22-03-4444 5 6
2 22-02-2020 7 8
...
What I would like to do is take the row from dataframe2 with the closest date to date1 in Dataframe1 but it has to fit the id, but the date has to be before the one of date1
The desired output would look like this:
id date1 date2 data data2
1 11-04-2022 11-03-2022 4 5
1 03-02-2011 11-02-1999 3 4
2 03-05-2222 22-02-2020 7 8
How would I do this using pandas?

Try pd.merge_asof, but first convert date1, date2 to datetime and sort both timeframes:
df1["date1"] = pd.to_datetime(df1["date1"])
df2["date2"] = pd.to_datetime(df2["date2"])
df1 = df1.sort_values(by="date1")
df2 = df2.sort_values(by="date2")
print(
pd.merge_asof(
df1,
df2,
by="id",
left_on="date1",
right_on="date2",
direction="nearest",
).dropna(subset=["date2"])
)

Related

Python Pandas - Selecting specific rows based on the max and min of two columns with the same group id

I am looking for a way to identify the row that is the 'master' row. The way I am defining the master row is for each group id the row that has the minimum in cust_hierarchy then if it is a tie use the row with the most recent date.
I have supplied some sample tables below:
row_id
group_id
cust_hierarchy
most_recent_date
master(I am looking for)
1
0
2
2020-01-03
1
2
0
7
2019-01-01
0
3
1
7
2019-05-01
0
4
1
6
2019-04-01
0
5
1
6
2019-04-03
1
I was thinking of possibly ordering by the two columns (cust_hierarchy (ascending), most_recent_date (descending), and then a new column that places a 1 on the first row for each group id?
Does anyone have any helpful code for this?
You basically can to an groupby with an idxmin(), but with a little bit of sorting to ensure the most recent use date is selected by the min operation:
import pandas as pd
import numpy as np
# example data
dates = ['2020-01-03','2019-01-01','2019-05-01',
'2019-04-01','2019-04-03']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'group_id':[0,0,1,1,1],
'cust_hierarchy':[2,7,7,6,6,],
'most_recent_date':dates})
# solution
df = df.sort_values('most_recent_date', ascending=False)
idxs = df.groupby('group_id')['cust_hierarchy'].idxmin()
df['master'] = np.where(df.index.isin(idxs), True, False)
df = df.sort_index()
df before:
group_id cust_hierarchy most_recent_date
0 0 2 2020-01-03
1 0 7 2019-01-01
2 1 7 2019-05-01
3 1 6 2019-04-01
4 1 6 2019-04-03
df after:
group_id cust_hierarchy most_recent_date master
0 0 2 2020-01-03 True
1 0 7 2019-01-01 False
2 1 7 2019-05-01 False
3 1 6 2019-04-01 False
4 1 6 2019-04-03 True
Use duplicated on sort_values:
df['master'] = 1- (df.sort_values(['cust_hierarchy', 'most_recent_date'],
ascending=[False, True])
.duplicated('group_id', keep='last')
.astype(int)
)

Create Pandas TimeSeries from Data, Period-range and aggregation function

Context
I'd like to create a time series (with pandas), to count distinct value of an Id if start and end date are within the considered date.
For sake of legibility, this is a simplified version of the problem.
Data
Let's define the Data this way:
df = pd.DataFrame({
'customerId': [
'1', '1', '1', '2', '2'
],
'id': [
'1', '2', '3', '1', '2'
],
'startDate': [
'2000-01', '2000-01', '2000-04', '2000-05', '2000-06',
],
'endDate': [
'2000-08', '2000-02', '2000-07', '2000-07', '2000-08',
],
})
And the period range this way:
period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')
Objectives
For each customerId, there are several distinct id.
The final aim is to get, for each date of the period-range, for each customerId, the count of distinct id whose start_date and end_date matches the function my_date_predicate.
Simplified definition of my_date_predicate:
unset_date = pd.to_datetime("1900-01")
def my_date_predicate(date, row):
return row.startDate <= date and \
(row.endDate.equals(unset_date) or row.endDate > date)
Awaited result
I'd like a time series result like this:
date customerId customerCount
0 2000-01 1 2
1 2000-01 2 0
2 2000-02 1 1
3 2000-02 2 0
4 2000-03 1 1
5 2000-03 2 0
6 2000-04 1 2
7 2000-04 2 0
8 2000-05 1 2
9 2000-05 2 1
10 2000-06 1 2
11 2000-06 2 2
12 2000-07 1 1
13 2000-07 2 0
Question
How could I use pandas to get such result?
Here's a solution:
df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)
df["month"] = df.apply(lambda row: pd.date_range(row["startDate"], row["endDate"], freq="MS", closed = "left"), axis=1)
df = df.explode("month")
period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')
t = pd.DataFrame(period_range.to_timestamp(), columns=["month"])
customers_df = pd.DataFrame(df.customerId.unique(), columns = ["customerId"])
t = pd.merge(t.assign(dummy=1), customers_df.assign(dummy=1), on = "dummy").drop("dummy", axis=1)
t = pd.merge(t, df, on = ["customerId", "month"], how = "left")
t.groupby(["month", "customerId"]).count()[["id"]].rename(columns={"id": "count"})
The result is:
count
month customerId
2000-01-01 1 2
2 0
2000-02-01 1 1
2 0
2000-03-01 1 1
2 0
2000-04-01 1 2
2 0
2000-05-01 1 2
2 1
2000-06-01 1 2
2 2
2000-07-01 1 1
2 1
Note:
For unset dates, replace the end date with the very last date you're interested in before you start the calculation.
You can do it with 2 pivot_table to get the count of id per customer in column per start date (and end date) in index. reindex each one with the period_date you are interested in. Substract the pivot for end from the pivot for start. Use cumsum to get the cumulative some of id per customer id. Finally use stack and reset_index to bring to the wanted shape.
#convert to period columns like period_date
df['startDate'] = pd.to_datetime(df['startDate']).dt.to_period('M')
df['endDate'] = pd.to_datetime(df['endDate']).dt.to_period('M')
#create the pivots
pvs = (df.pivot_table(index='startDate', columns='customerId', values='id',
aggfunc='count', fill_value=0)
.reindex(period_range, fill_value=0)
)
pve = (df.pivot_table(index='endDate', columns='customerId', values='id',
aggfunc='count', fill_value=0)
.reindex(period_range, fill_value=0)
)
print (pvs)
customerId 1 2
2000-01 2 0 #two id for customer 1 that start at this month
2000-02 0 0
2000-03 0 0
2000-04 1 0
2000-05 0 1 #one id for customer 2 that start at this month
2000-06 0 1
2000-07 0 0
Now you can substract one to the other and use cumsum to get the wanted amount per date.
res = (pvs - pve).cumsum().stack().reset_index()
res.columns = ['date', 'customerId','customerCount']
print (res)
date customerId customerCount
0 2000-01 1 2
1 2000-01 2 0
2 2000-02 1 1
3 2000-02 2 0
4 2000-03 1 1
5 2000-03 2 0
6 2000-04 1 2
7 2000-04 2 0
8 2000-05 1 2
9 2000-05 2 1
10 2000-06 1 2
11 2000-06 2 2
12 2000-07 1 1
13 2000-07 2 1
Note really sure how to handle the unset_date as I don't see what is used for

Pandas Group by date weekly

Im using pandas to do some calculations, and I need to sum some of values like count1 and count2 based on dates weekly or daily and so on,
my df =
id count1 ... count2 date
0 1 1 ... 52 2019-12-09
1 2 1 ... 23 2019-12-10
2 3 1 ... 0 2019-12-11
3 4 1 ... 17 2019-12-18
4 5 1 ... 20 2019-12-20
5 6 1 ... 4 2019-12-21
6 7 1 ... 2 2019-12-21
how can I do groupby date with weekly freq.?
I tried many ways but I got different errors
many thanks
Use DataFrame.resample by W with sum:
#convert date column to datetimes
df['date'] = pd.to_datetime(df['date'])
df1 = df.resample('W', on='date')['count1','count2'].sum()
Or use Grouper:
df1 = df.groupby(pd.Grouper(freq='W', key='date'))['count1','count2'].sum()
print (df1)
count1 count2
date
2019-12-15 3 75
2019-12-22 4 43
first, generate a week column so you can group by it ;
df['week_id'] = df['date'].dt.week
then group the dataframe and iterate through each group and do your stuff:
grouped_df = df.groupby('week_id')
for index, sub_df in grouped_df:
#use sub_df, it is data for each week

How to drop records based on number of unique days using pandas?

I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-03 13:39:00','2173-07-04 11:30:00','2173-04-04 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
What I would like to do is drop records/subjects who doesn't have more than 4 or more unique days
If you see my sample dataframe, you can see that subject_id = 1 has only 3 unique dates which is 3,4 and 5 so I would like to drop subject_id = 1 completely. But if you see subject_id = 2 he has more than 4 unique dates like 4,9,11,13,14. Please note that date values has timestamp, hence I extract the day from each datetime field and check for unique records.
This is what I tried
df.groupby(['subject_id','day']).transform('size')>4 # doesn't work
df[df.groupby(['subject_id','day'])['subject_id'].transform('size')>=4] # doesn't produce expected output
I expect my output to be like this
Change your function from size to DataFrameGroupBy.nunique, grouping only by the subject_id column:
df = df[df.groupby('subject_id')['day'].transform('nunique')>=4]
Or alternatively you can use filtration, but this should be slower if you're using a larger dataframe or many unique groups:
df = df.groupby('subject_id').filter(lambda x: x['day'].nunique()>=4)
print (df)
subject_id time_1 val day month
7 2 2173-04-04 16:00:00 5 4 4
8 2 2173-04-09 22:00:00 8 9 4
9 2 2173-04-11 04:00:00 3 11 4
10 2 2173-04-13 04:30:00 4 13 4
11 2 2173-04-14 08:00:00 6 14 4

Pandas convert string to end of month date

I have this problem where 1 of the columns in my df is entered in as string, but I want to convert it into end of date month in python. For example,
Id Name Date Number
0 1 A 201601 5
1 2 B 201602 6
2 3 C 201603 4
The Date column has the year and month as string. Ideally, my goal is:
Id Name Date Number
0 1 A 01/31/2016 5
1 2 B 02/29/2016 6
2 3 C 03/31/2016 4
I was able to do this on excel using Endmonth and cut string, but when I tried pd.to_datetime in python, it didn't work. Thanks!
we can using MonthEnd
from pandas.tseries.offsets import MonthEnd
df.Date=(pd.to_datetime(df.Date,format='%Y%m')+MonthEnd(1)).dt.strftime('%m/%d/%Y')
df
Out[1336]:
Id Name Date Number
0 1 A 01/31/2016 5
1 2 B 02/29/2016 6
2 3 C 03/31/2016 4
you can use PeriodIndex:
In [36]: df['Date'] = pd.PeriodIndex(df['Date'].astype(str), freq='M').strftime('%m/%d/%Y')
In [37]: df
Out[37]:
Id Name Date Number
0 1 A 01/31/2016 5
1 2 B 02/29/2016 6
2 3 C 03/31/2016 4

Categories

Resources