How to join pandas dataframe to itself by condition? - python

I'm having a python pandas dataframe with 2 relevant columns "date" and "value", let's assume it looks like this and is ordered by date:
data = pd.DataFrame({"date": ["2021-01-01", "2021-01-31", "2021-02-01", "2021-02-28", "2021-03-01", "2021-03-31", "2021-04-01", "2021-04-02"],
"value": [1,2,3,4,5,6,5,8]})
data["date"] = pd.to_datetime(data['date'])
Now I want to join the dataFrame to itself in such a way, that I get for each last available day in month the next available day where the value is higher. In our example this should basically look like this:
date, value, date2, value2:
2021-01-31, 2, 2021-02-01, 3
2021-02-28, 4, 2021-03-01, 5
2021-03-31, 6, 2021-04-02, 8
2021-04-02, 8, NaN, NaN
My current partial solution to this problem looks like this:
last_days = data.groupby([data.date.dt.year, data.date.dt.month]).last()
res = [data.loc[(data.date>date) & (data.value > value)][:1] for date, value in zip(last_days.date, last_days.value)]
print(res)
But because of this answer "Don't iterate over rows in a dataframe", it doesn't feel like the pandas way to me.
So the question is, how to solve it the pandas way?

If you don’t have too many rows, you could generate all pairs of items and filter from there.
Let’s start with getting the last days in the month:
>>> last = data.loc[data['date'].dt.daysinmonth == data['date'].dt.day]
>>> last
date value
1 2021-01-31 2
3 2021-02-28 4
5 2021-03-31 6
Now use a cross join to map each last day to any possible day, then filter on criteria such as later date and larger value:
>>> pairs = pd.merge(last, data, how='cross', suffixes=('', '2'))
>>> pairs = pairs.loc[pairs['date2'].gt(pairs['date']) & pairs['value2'].gt(pairs['value'])]
>>> pairs
date value date2 value2
2 2021-01-31 2 2021-02-01 3
3 2021-01-31 2 2021-02-28 4
4 2021-01-31 2 2021-03-01 5
5 2021-01-31 2 2021-03-31 6
6 2021-01-31 2 2021-04-01 5
7 2021-01-31 2 2021-04-02 8
12 2021-02-28 4 2021-03-01 5
13 2021-02-28 4 2021-03-31 6
14 2021-02-28 4 2021-04-01 5
15 2021-02-28 4 2021-04-02 8
23 2021-03-31 6 2021-04-02 8
Finally use GroupBy.idxmin() to get the first date2
>>> pairs.loc[pairs.groupby(['date', 'value'])['value2'].idxmin().values]
date value date2 value2
2 2021-01-31 2 2021-02-01 3
12 2021-02-28 4 2021-03-01 5
23 2021-03-31 6 2021-04-02 8
Otherwise you might want apply, which is pretty much the same as iterating on rows to be entirely honest.

First create 2 masks: one for the end day of month and another one for the first day of the next month.
m1 = data['date'].diff(1).shift(-1) == pd.Timedelta(days=1)
m2 = m1.shift(1, fill_value=False)
Finally, concatenate the 2 results ignoring index:
>>> pd.concat([data.loc[m1].reset_index(drop=True),
data.loc[m2].reset_index(drop=True)], axis="columns")
date value date value
0 2021-01-31 2 2021-02-01 3
1 2021-02-28 4 2021-03-01 5
2 2021-03-31 6 2021-04-01 5
3 2021-04-01 5 2021-04-02 8

One option is with the conditional_join from pyjanitor, which uses binary search underneath, and should be faster/more memory efficient than a cross merge, as the data size increases. Also, have a look at the piso library and see if it can be helpful/more efficient:
Get the last dates, via a groupby (assumption here is that the data is already sorted; if not, you can sort it before grouping):
# pip install pyjanitor
import pandas as pd
import janitor
trim = (data
.groupby([data.date.dt.year, data.date.dt.month], as_index = False)
.nth(-1)
)
trim
date value
1 2021-01-31 2
3 2021-02-28 4
5 2021-03-31 6
7 2021-04-02 8
Use conditional_join to get rows where the value from trim is less than data, and the date from trim is less than data as well:
trimmed = trim.conditional_join(data,
# variable arguments
# tuple is of the form:
# col_from_left_df, col_from_right_df, comparator
('value', 'value', '<'),
('date', 'date', '<'),
how = 'left')
trimmed
left right
date value date value
0 2021-01-31 2 2021-02-01 3.0
1 2021-01-31 2 2021-02-28 4.0
2 2021-01-31 2 2021-03-01 5.0
3 2021-01-31 2 2021-04-01 5.0
4 2021-01-31 2 2021-03-31 6.0
5 2021-01-31 2 2021-04-02 8.0
6 2021-02-28 4 2021-03-01 5.0
7 2021-02-28 4 2021-04-01 5.0
8 2021-02-28 4 2021-03-31 6.0
9 2021-02-28 4 2021-04-02 8.0
10 2021-03-31 6 2021-04-02 8.0
11 2021-04-02 8 NaT NaN
Since the only interest is in the first match, a groupby is required.
trimmed = (trimmed
.groupby(('left', 'date'), dropna = False, as_index = False)
.nth(0)
)
trimmed
left right
date value date value
0 2021-01-31 2 2021-02-01 3.0
6 2021-02-28 4 2021-03-01 5.0
10 2021-03-31 6 2021-04-02 8.0
11 2021-04-02 8 NaT NaN
You can fix the columns, to flat form:
trimmed.set_axis(['date', 'value', 'date2', 'value2'], axis = 'columns')
date value date2 value2
0 2021-01-31 2 2021-02-01 3.0
6 2021-02-28 4 2021-03-01 5.0
10 2021-03-31 6 2021-04-02 8.0
11 2021-04-02 8 NaT NaN

Related

December/January Seasonal Mean

I am attempting to calculate the seasonal means for the winter months of DJF and DJ. I first tried to use Xarray's .groupby function:
ds.groupby('time.month').mean('time')
Then I realized that instead of grouping by the previous years' December and the subsequent Jan/Feb., it was grouping all three months from the same year. I was then able to figure out how to solve for the DJF season by resampling and creating a function to select out the proper 3 month period:
>def is_djf(month):
return (month == 12)
>ds.resample('QS-MAR').mean('time')
>ds.sel(time=is_djf(ds['time.month']))
I am still unfortunately unsure how to solve for the Dec./Jan. season since the resampling method I used was for offsetting quarterly. Thank you for any and all help!
Use resample with QS-DEC.
Suppose this dataframe:
time val
0 2020-12-31 1
1 2021-01-31 1
2 2021-02-28 1
3 2021-03-31 2
4 2021-04-30 2
5 2021-05-31 2
6 2021-06-30 3
7 2021-07-31 3
8 2021-08-31 3
9 2021-09-30 4
10 2021-10-31 4
11 2021-11-30 4
12 2021-12-31 5
13 2022-01-31 5
14 2022-02-28 5
>>> df.set_index('time').resample('QS-DEC').mean()
val
time
2020-12-01 1.0
2021-03-01 2.0
2021-06-01 3.0
2021-09-01 4.0
2021-12-01 5.0

Pandas fill missing dates and values simultaneously for each group

I have a dataframe (mydf) with dates for each group in monthly frequency like below:
Dt Id Sales
2021-03-01 B 2
2021-04-01 B 42
2021-05-01 B 20
2021-06-01 B 4
2020-10-01 A 47
2020-11-01 A 67
2020-12-01 A 46
I want to fill the dt for each group till the Maximum date within the date column starting from the date of Id while simultaneously filling in 0 for the Sales column. So each group starts at their own start date but ends at the same end date.
So for e.g. ID=A will start from 2020-10-01 and go all the way to 2021-06-03 and the value for the filled dates will be 0.
So the output will be
Dt Id Sales
2021-03-01 B 2
2021-04-01 B 42
2021-05-01 B 20
2021-06-01 B 4
2020-10-01 A 46
2020-11-01 A 47
2020-12-01 A 67
2021-01-01 A 0
2021-02-01 A 0
2021-03-01 A 0
2021-04-01 A 0
2021-05-01 A 0
2021-06-01 A 0
I have tried reindex but instead of adding daterange manually I want to use the dates in the groups.
My code is :
f = lambda x: x.reindex(pd.date_range('2020-10-01', '2021-06-01', freq='MS', name='Dt'))
mydf = mydf.set_index('Dt').groupby('Id').apply(f).drop('Id', axis=1).fillna(0)
mydf = mydf.reset_index()
Let's try:
Getting the minimum value per group using groupby.min
Add a new column to the aggregated mins called max which stores the maximum values from the frame using Series.max on Dt
Create individual date_range per group based on the min and max values
Series.explode into rows to have a DataFrame that represents the new index.
Create a MultiIndex.from_frame to reindex the DataFrame with.
reindex with midx and set the fillvalue=0
# Get Min Per Group
dates = mydf.groupby('Id')['Dt'].min().to_frame(name='min')
# Get max from Frame
dates['max'] = mydf['Dt'].max()
# Create MultiIndex with separate Date ranges per Group
midx = pd.MultiIndex.from_frame(
dates.apply(
lambda x: pd.date_range(x['min'], x['max'], freq='MS'), axis=1
).explode().reset_index(name='Dt')[['Dt', 'Id']]
)
# Reindex
mydf = (
mydf.set_index(['Dt', 'Id'])
.reindex(midx, fill_value=0)
.reset_index()
)
mydf:
Dt Id Sales
0 2020-10-01 A 47
1 2020-11-01 A 67
2 2020-12-01 A 46
3 2021-01-01 A 0
4 2021-02-01 A 0
5 2021-03-01 A 0
6 2021-04-01 A 0
7 2021-05-01 A 0
8 2021-06-01 A 0
9 2021-03-01 B 2
10 2021-04-01 B 42
11 2021-05-01 B 20
12 2021-06-01 B 4
DataFrame:
import pandas as pd
mydf = pd.DataFrame({
'Dt': ['2021-03-01', '2021-04-01', '2021-05-01', '2021-06-01', '2020-10-01',
'2020-11-01', '2020-12-01'],
'Id': ['B', 'B', 'B', 'B', 'A', 'A', 'A'],
'Sales': [2, 42, 20, 4, 47, 67, 46]
})
mydf['Dt'] = pd.to_datetime(mydf['Dt'])
An alternative using pd.MultiIndex with list comprehension:
s = (pd.MultiIndex.from_tuples([[x, d]
for x, y in df.groupby("Id")["Dt"]
for d in pd.date_range(min(y), max(df["Dt"]), freq="MS")], names=["Id", "Dt"]))
print (df.set_index(["Id", "Dt"]).reindex(s, fill_value=0).reset_index())
Here is a different approach:
from itertools import product
# compute the min-max date range
date_range = pd.date_range(*mydf['Dt'].agg(['min', 'max']), freq='MS', name='Dt')
# make MultiIndex per group, keep only values above min date per group
idx = pd.MultiIndex.from_tuples([e for Id,Dt_min in mydf.groupby('Id')['Dt'].min().items()
for e in list(product(date_range[date_range>Dt_min],
[Id]))
])
# concatenate the original dataframe and the missing indexes
mydf = mydf.set_index(['Dt', 'Id'])
mydf = pd.concat([mydf,
mydf.reindex(idx.difference(mydf.index)).fillna(0)]
).sort_index(level=1).reset_index()
mydf
output:
Dt Id Sales
0 2020-10-01 A 47.0
1 2020-11-01 A 67.0
2 2020-12-01 A 46.0
3 2021-01-01 A 0.0
4 2021-02-01 A 0.0
5 2021-03-01 A 0.0
6 2021-04-01 A 0.0
7 2021-05-01 A 0.0
8 2021-06-01 A 0.0
9 2021-03-01 B 2.0
10 2021-04-01 B 42.0
11 2021-05-01 B 20.0
12 2021-06-01 B 4.0
We can use the complete function from pyjanitor to expose the missing values:
Convert Dt to datetime:
df['Dt'] = pd.to_datetime(df['Dt'])
Create a mapping of Dt to new values, via pd.date_range, and set the frequency to monthly begin (MS):
max_time = df.Dt.max()
new_values = {"Dt": lambda df:pd.date_range(df.min(), max_time, freq='1MS')}
# pip install pyjanitor
import janitor
import pandas as pd
df.complete([new_values], by='Id').fillna(0)
Id Dt Sales
0 A 2020-10-01 47.0
1 A 2020-11-01 67.0
2 A 2020-12-01 46.0
3 A 2021-01-01 0.0
4 A 2021-02-01 0.0
5 A 2021-03-01 0.0
6 A 2021-04-01 0.0
7 A 2021-05-01 0.0
8 A 2021-06-01 0.0
9 B 2021-03-01 2.0
10 B 2021-04-01 42.0
11 B 2021-05-01 20.0
12 B 2021-06-01 4.0
Sticking to Pandas only, we can combine apply, with groupby and reindex; thankfully, Dt is unique, so we can safely reindex:
(df
.set_index('Dt')
.groupby('Id')
.apply(lambda df: df.reindex(pd.date_range(df.index.min(),
max_time,
freq='1MS'),
fill_value = 0)
)
.drop(columns='Id')
.rename_axis(['Id', 'Dt'])
.reset_index())
Id Dt Sales
0 A 2020-10-01 47
1 A 2020-11-01 67
2 A 2020-12-01 46
3 A 2021-01-01 0
4 A 2021-02-01 0
5 A 2021-03-01 0
6 A 2021-04-01 0
7 A 2021-05-01 0
8 A 2021-06-01 0
9 B 2021-03-01 2
10 B 2021-04-01 42
11 B 2021-05-01 20
12 B 2021-06-01 4

How would I get the values from a previously assigned index

I have two data frames, the first column is formed by getting the index values from the other data frame. This is tested and successfully returns 5 entries.
The second line executes but assigns NaN to all rows in "StartPrice" column
df = pd.DataFrame()
df["StartBar"] = df_rs["HighTrendStart"].dropna().index # Works
df["StartPrice"] = df_rs["HighTrendStart"].loc[df["StartBar"]] # Assigns Nan's to all rows
As pointed out by #YOBEN_S, the indexes do not match.
Date
2020-05-01 00:00:00 NaN
2020-05-01 00:15:00 NaN
2020-05-01 00:30:00 NaN
2020-05-01 00:45:00 NaN
2020-05-01 01:00:00 NaN
Freq: 15T, Name: HighTrendStart, dtype: float64
0 2020-05-01 02:30:00
1 2020-05-01 06:30:00
2 2020-05-01 13:45:00
3 2020-05-01 16:15:00
4 2020-05-01 20:00:00
Name: StartBar, dtype: datetime64[ns]
You should make sure the index did not match when you assign the value from different dataframe
df["StartPrice"] = df_rs["HighTrendStart"].loc[df["StartBar"]].to_numpy()
For example
df=pd.DataFrame({'a':[1,2,3,4,5,6]})
s=pd.Series([1,2,3,4,5,6],index=list('abcdef'))
df
Out[190]:
a
0 1
1 2
2 3
3 4
4 5
5 6
s
Out[191]:
a 1
b 2
c 3
d 4
e 5
f 6
dtype: int64
df['New']=s
df
Out[193]:
a New
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN

Counting Number of Occurrences Between Dates (Given an ID value) From Another Dataframe

Pandas: select DF rows based on another DF is the closest answer I can find to my question, but I don't believe it quite solves it.
Anyway, I am working with two very large pandas dataframes (so speed is a consideration), df_emails and df_trips, both of which are already sorted by CustID and then by date.
df_emails includes the date we sent a customer an email and it looks like this:
CustID DateSent
0 2 2018-01-20
1 2 2018-02-19
2 2 2018-03-31
3 4 2018-01-10
4 4 2018-02-26
5 5 2018-02-01
6 5 2018-02-07
df_trips includes the dates a customer came to the store and how much they spent, and it looks like this:
CustID TripDate TotalSpend
0 2 2018-02-04 25
1 2 2018-02-16 100
2 2 2018-02-22 250
3 4 2018-01-03 50
4 4 2018-02-28 100
5 4 2018-03-21 100
6 8 2018-01-07 200
Basically, what I need to do is find the number of trips and total spend for each customer in between each email sent. If it is the last time an email is sent for a given customer, I need to find the total number of trips and total spend after the email, but before the end of the data (2018-04-01). So the final dataframe would look like this:
CustID DateSent NextDateSentOrEndOfData TripsBetween TotalSpendBetween
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 2018-04-01 0.0 0.0
3 4 2018-01-10 2018-02-26 0.0 0.0
4 4 2018-02-26 2018-04-01 2.0 200.0
5 5 2018-02-01 2018-02-07 0.0 0.0
6 5 2018-02-07 2018-04-01 0.0 0.0
Though I have tried my best to do this in a Python/Pandas friendly way, the only accurate solution I have been able to implement is through an np.where, shifting, and looping. The solution looks like this:
df_emails["CustNthVisit"] = df_emails.groupby("CustID").cumcount()+1
df_emails["CustTotalVisit"] = df_emails.groupby("CustID")["CustID"].transform('count')
df_emails["NextDateSentOrEndOfData"] = pd.to_datetime(df_emails["DateSent"].shift(-1)).where(df_emails["CustNthVisit"] != df_emails["CustTotalVisit"], pd.to_datetime('04-01-2018'))
for i in df_emails.index:
df_emails.at[i, "TripsBetween"] = len(df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])])
for i in df_emails.index:
df_emails.at[i, "TotalSpendBetween"] = df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])].TotalSpend.sum()
df_emails.drop(['CustNthVisit',"CustTotalVisit"], axis=1, inplace=True)
However, a %%timeit has revealed that this takes 10.6ms on just the seven rows shown above, which makes this solution pretty much infeasible on my actual datasets of about 1,000,000 rows. Does anyone know a solution here that is faster and thus feasible?
Add the next date column to emails
df_emails["NextDateSent"] = df_emails.groupby("CustID").shift(-1)
Sort for merge_asof and then merge to nearest to create a trip lookup table
df_emails = df_emails.sort_values("DateSent")
df_trips = df_trips.sort_values("TripDate")
df_lookup = pd.merge_asof(df_trips, df_emails, by="CustID", left_on="TripDate",right_on="DateSent", direction="backward")
Aggregate the lookup table for the data you want.
df_lookup = df_lookup.loc[:, ["CustID", "DateSent", "TotalSpend"]].groupby(["CustID", "DateSent"]).agg(["count","sum"])
Left join it back to the email table.
df_merge = df_emails.join(df_lookup, on=["CustID", "DateSent"]).sort_values("CustID")
I choose to leave NaNs as NaNs because I don't like filling default values (you can always do that later if you prefer, but you can't easily distinguish between things that existed vs things that didn't if you put defaults in early)
CustID DateSent NextDateSent (TotalSpend, count) (TotalSpend, sum)
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 NaT NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 NaT 2.0 200.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 NaT NaN NaN
This would be an easy case of merge_asof had I been able to handle the max_date, so I go a long way:
max_date = pd.to_datetime('2018-04-01')
# set_index for easy extraction by id
df_emails.set_index('CustID', inplace=True)
# we want this later in the final output
df_emails['NextDateSentOrEndOfData'] = df_emails.groupby('CustID').shift(-1).fillna(max_date)
# cuts function for groupby
def cuts(df):
custID = df.CustID.iloc[0]
bins=list(df_emails.loc[[custID], 'DateSent']) + [max_date]
return pd.cut(df.TripDate, bins=bins, right=False)
# bin the dates:
s = df_trips.groupby('CustID', as_index=False, group_keys=False).apply(cuts)
# aggregate the info:
new_df = (df_trips.groupby([df_trips.CustID, s])
.TotalSpend.agg(['sum', 'size'])
.reset_index()
)
# get the right limit:
new_df['NextDateSentOrEndOfData'] = new_df.TripDate.apply(lambda x: x.right)
# drop the unnecessary info
new_df.drop('TripDate', axis=1, inplace=True)
# merge:
df_emails.reset_index().merge(new_df,
on=['CustID','NextDateSentOrEndOfData'],
how='left'
)
Output:
CustID DateSent NextDateSentOrEndOfData sum size
0 2 2018-01-20 2018-02-19 125.0 2.0
1 2 2018-02-19 2018-03-31 250.0 1.0
2 2 2018-03-31 2018-04-01 NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 2018-04-01 200.0 2.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 2018-04-01 NaN NaN

Find a one year date from now in Python/Pandas with approximations

I have a df pandas with in column just a 'Price' and in index dates. I want to find a new column called 'Aprox' with inside
aprox. = price of today - price of one year ago (or closest date from a year ago) -
price in one year (again take aprox if exact one year price don't exist)
for example
aprox. 2019-04-30 = 8 -4 -10 = -6 = aprox. 2019-04-30
- aprox. 2018-01-31 - aprox.2020-07-30
To be honest I am a bit strugling with that...
ex. [in]: Price
2018-01-31 4
2019-04-30 8
2020-07-30 10
2020-10-31 9
2021-01-31 14
2021-04-30 150
2021-07-30 20
2022-10-31 14
[out]: Price aprox.
2018-01-31 4
2019-04-30 8 -6 ((8-4-10) = -6) since there is no 2018-04-30
2020-07-30 10 -12 (10-14-8)
2020-10-31 9 ...
2021-01-31 14 ...
2021-04-30 150
2021-07-30 20
2022-10-31 14
I am strugling very much with that... even more with the approx.
Thank you very much!!
It's not quite clear to me what you are trying to do, but maybe this is what you want:
import pandas
def last_year(x):
"""
Return date from a year ago.
"""
return x - pandas.DateOffset(years=1)
# Simulate the data you provided in example
dt_str = ['2018-01-31', '2019-04-30', '2020-07-30', '2020-10-31',
'2021-01-31', '2021-04-30', '2021-07-30', '2022-10-31']
dates = [pandas.Timestamp(x) for x in dt_str]
df = pandas.DataFrame([4, 8, 10, 9, 14, 150, 20, 14], columns=['Price'], index=dates)
# This is the code that does the work
for dt, value in df['Price'].iteritems():
df.loc[dt, 'approx'] = value - df['Price'].asof(last_year(dt))
This gave me the following results:
In [147]: df
Out[147]:
Price approx
2018-01-31 4 NaN
2019-04-30 8 4.0
2020-07-30 10 2.0
2020-10-31 9 1.0
2021-01-31 14 6.0
2021-04-30 150 142.0
2021-07-30 20 10.0
2022-10-31 14 -6.0
The bottom line is that for this type of operation you can't just use the apply operation since you need both the index and the value.

Categories

Resources