I have two dataframes and for one I want to find the closest (previous) date in the other.
If the date matches then I need to take the previous date
df_main contains the reference information
For df_sample I want to lookup the Time in df_main for the closest (but previous) entry. I can do this using method='ffill' , but where the date for the Time field is the same day it returns that day - I want it to return the previous - basically a < rather than <=.
In my example df_res I want the closest_val column to contain [ "n/a", 90, 90, 280, 280, 280]
import pandas as pd
dsample = {'Index': [1, 2, 3, 4, 5, 6],
'Time': ["2020-06-01", "2020-06-02", "2020-06-03", "2020-06-04" ,"2020-06-05" ,"2020-06-06"],
'Pred': [100, -200, 300, -400 , -500, 600]
}
dmain = {'Index': [1, 2, 3],
'Time': ["2020-06-01", "2020-06-03","2020-06-06"],
'Actual': [90, 280, 650]
}
def find_closest(x, df2):
df_res = df2.iloc[df2.index.get_loc(x['Time'], method='ffill')]
x['closest_time'] = df_res['Time']
x['closest_val'] = df_res['Actual']
return x
df_sample = pd.DataFrame(data=dsample)
df_main = pd.DataFrame(data=dmain)
df_sample = df_sample.set_index(pd.DatetimeIndex(df_sample['Time']))
df_main = df_main.set_index(pd.DatetimeIndex(df_main['Time']))
df_res = df_sample.apply(find_closest, df2=df_main ,axis=1)
Use pd.merge_asof (make sure 'Time' is indeed a datetime):
pd.merge_asof(dsample, dmain, left_on="Time", right_on="Time", allow_exact_matches=False)
The output is:
Index_x Time Pred Index_y Actual
0 1 2020-06-01 100 NaN NaN
1 2 2020-06-02 -200 1.0 90.0
2 3 2020-06-03 300 1.0 90.0
3 4 2020-06-04 -400 2.0 280.0
4 5 2020-06-05 -500 2.0 280.0
5 6 2020-06-06 600 2.0 280.0
IIUC, we can do a Cartesian product of both your dataframes, then filter out the exact matches, then apply some logic to figure out the closest date.
Finally, we will join your extact, and non exact matches into a final dataframe.
s = pd.merge(
df_sample.assign(key="var1"),
df_main.assign(key="var1").rename(columns={"Time": "TimeDelta"}).drop("Index", 1),
on="key",
how="outer",
).drop("key", 1)
extact_matches = s[s['Time'].eq(s['TimeDelta'])]
non_exact_matches_cart = s[~s['Time'].isin(extact_matches['Time'])]
non_exact_matches = non_exact_matches_cart.assign(
delta=(non_exact_matches_cart["Time"] - non_exact_matches_cart["TimeDelta"])
/ np.timedelta64(1, "D")
).query("delta >= 0").sort_values(["Time", "delta"]).drop_duplicates(
"Time", keep="first"
).drop('delta',1)
alot to take in the above variable, but essentially we are finding the difference in time, removing any difference that goes into the future, and dropping the values keeping the closest date in the past.
df = pd.concat([extact_matches, non_exact_matches], axis=0).sort_values("Time").rename(
columns={"TimeDelta": "closest_time", "Actual": "closest val"}
)
print(df)
Index Time Pred closest_time closest val
0 1 2020-06-01 100 2020-06-01 90
3 2 2020-06-02 -200 2020-06-01 90
7 3 2020-06-03 300 2020-06-03 280
10 4 2020-06-04 -400 2020-06-03 280
13 5 2020-06-05 -500 2020-06-03 280
17 6 2020-06-06 600 2020-06-06 650
Related
I have this example dataframe below. I created a function that does what I want, computing a Sales rolling average (7,14 days window) for each Store for the previous day and shifts it to the current date. How can I compute this only for a specific date, 2022-12-31, for example? I have a lot of rows and I don't want to recalculate it each time I add a date.
import numpy as np
import pandas as pd
ex = pd.DataFrame({'Date':pd.date_range('2022-10-01', '2022-12-31'),
'Store': np.random.choice(2, len(pd.date_range('2022-10-01', '2022-12-31'))),
'Sales': np.random.choice(10000, len(pd.date_range('2022-10-01', '2022-12-31')))})
ex.sort_values(['Store','Date'], ascending=False, inplace=True)
for days in [7, 14]:
ex['Sales_mean_' + str(days) + '_days'] = ex.groupby('Store')[['Sales']].apply(lambda x: x.shift(-1).rolling(days).mean().shift(-days+1))```
I redefined a similar dataframe because using a random variable generator makes debugging difficult. At each test the dataframe changes randomly.
In addition to keep it simple, I will use 2 and 3 moving average periods.
Starting dataframe
Date Store Sales
9 2022-10-10 1 5347
8 2022-10-09 1 1561
7 2022-10-08 1 5648
6 2022-10-07 1 8123
5 2022-10-06 1 1401
4 2022-10-05 0 2745
3 2022-10-04 0 7848
2 2022-10-03 0 3151
1 2022-10-02 0 4296
0 2022-10-01 0 9028
It gives :
ex = pd.DataFrame({
"Date": pd.date_range('2022-10-01', '2022-10-10'),
"Store": [0]*5+[1]*5,
"Sales": [9028, 4296, 3151, 7848, 2745, 1401, 8123, 5648, 1561, 5347],
})
ex.sort_values(['Store','Date'], ascending=False, inplace=True)
Proposed code
import pandas as pd
import numpy as np
ex = pd.DataFrame({
"Date": pd.date_range('2022-10-01', '2022-10-10'),
"Store": [0]*5+[1]*5,
"Sales": [9028, 4296, 3151, 7848, 2745, 1401, 8123, 5648, 1561, 5347],
})
ex.sort_values(['Store','Date'], ascending=False, inplace=True)
periods=(2,3)
### STEP 1 -- Initialization : exhaustive Mean() Calculation
for per in periods:
ex["Sales_mean_{0}_days".format(per)] = (
ex.groupby(['Store'])['Sales']
.apply(lambda g: g.shift(-1)
.rolling(per)
.mean()
.shift(-per+1))
)
### STEP 2 -- New Row Insertion
def fmt_newRow(g, newRow, periods):
return {
"Date": pd.Timestamp(newRow[0]),
"Store": newRow[1],
"Sales": newRow[2],
"Sales_mean_{0}_days".format(periods[0]): g['Sales'].iloc[0:periods[0]].mean(),
"Sales_mean_{0}_days".format(periods[1]): g['Sales'].iloc[0:periods[1]].mean(),
}
def add2DF(ex, newRow):
# g : sub-Store group
g = (
ex.loc[ex.Store==newRow[1]]
.sort_values(['Store','Date'], ascending=False)
)
# Append newRow like a dictionnary and sort by ['Store','Date']
ex = (
ex.append(fmt_newRow(g, newRow, periods), ignore_index=True)
.sort_values(['Store','Date'], ascending=False)
.reset_index(drop=True)
)
#
return ex
newRow = ['2022-10-11', 1, 2803] # [Date, Store, Sales]
ex = add2DF(ex, newRow)
print(ex)
Result
Date Store Sales Sales_mean_2_days Sales_mean_3_days
0 2022-10-11 1 2803 3454.0 4185.333333
1 2022-10-10 1 5347 3604.5 5110.666667
2 2022-10-09 1 1561 6885.5 5057.333333
3 2022-10-08 1 5648 4762.0 NaN
4 2022-10-07 1 8123 NaN NaN
5 2022-10-06 1 1401 NaN NaN
6 2022-10-05 0 2745 5499.5 5098.333333
7 2022-10-04 0 7848 3723.5 5491.666667
8 2022-10-03 0 3151 6662.0 NaN
9 2022-10-02 0 4296 NaN NaN
10 2022-10-01 0 9028 NaN NaN
Comments
A new row is a list like this one : [Date, Store, Sales]
Each time you need to save a new row to dataframe, you pass it to fmt_newRow function with the corresponding subgroup g
fmt_newRow returns a new row on the form of a dictionnary which is integrated in the dataframe with append Pandas function
No need to recalculate all the averages, because only the per-last g values are used to calculate the new row average
Moving averages for periods 2 and 3 were checked and are correct.
I am trying to use groupby to group by symbol and return the average of prior high volume days using pandas.
I create my data:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"date": ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06'],
"symbol": ['ABC', 'ABC', 'ABC', 'AAA', 'AAA', 'AAA'],
"change": [20, 1, 2, 3, 50, 100],
"volume": [20000000, 100, 3000, 500, 40000000, 60000000],
})
Filter by high volume and change:
high_volume_days = df[(df['volume'] >= 20000000) & (df['change'] >= 20)]
Then I get the last days volume (this works):
high_volume_days['previous_high_volume_day'] = high_volume_days.groupby('symbol')['volume'].shift(1)
But when I try to calculate the average of all the days per symbol:
high_volume_days['avg_volume_prior_days'] = df.groupby('symbol')['volume'].mean()
I am getting NaNs:
date symbol change volume previous_high_volume_day avg_volume_prior_days
0 2022-01-01 ABC 20 20000000 NaN NaN
4 2022-01-05 AAA 50 40000000 NaN NaN
5 2022-01-06 AAA 100 60000000 40000000.0 NaN
What am I missing here?
Desired output:
date symbol change volume previous_high_volume_day avg_volume_prior_days
0 2022-01-01 ABC 20 20000000 NaN 20000000
4 2022-01-05 AAA 50 40000000 NaN 40000000
5 2022-01-06 AAA 100 60000000 40000000.0 50000000
high_volume_days['avg_volume_prior_days'] = high_volume_days.groupby('symbol', sort=False)['volume'].expanding().mean().droplevel(0)
high_volume_days
date symbol change volume previous_high_volume_day avg_volume_prior_days
0 2022-01-01 ABC 20 20000000 NaN 20000000.0
4 2022-01-05 AAA 50 40000000 NaN 40000000.0
5 2022-01-06 AAA 100 60000000 40000000.0 50000000.0
Index misalignment: high_volume_days is indexed by integers. The df.groupby(...) is indexed by the symbol.
Use merge instead:
high_volume_days = pd.merge(
high_volume_days,
df.groupby("symbol")["volume"].mean().rename("avg_volume_prior_days"),
left_on="symbol",
right_index=True,
)
df.groupby('symbol')['volume'].mean() returns:
symbol
AAA 33333500.0
ABC 6667700.0
Name: volume, dtype: float64
which is an aggregation of each group to a single value. Note that the groups (symbol) are the index of this series. When you try to assign it back to high_volume_days, there is an index misalignment.
Instead of an aggregation (.mean() is equivalent to .agg("mean")), you should use a transformation: .transform("mean").
==== EDIT ====
Instead of the mean for all values, you're looking for the mean "thus far". You can typically do that using .expanding().mean(), but since you're reassigning back to a column in high_volume_days, you need to either drop the level that contains the symbols, or use a lambda:
high_volume_days.groupby('symbol')['volume'].expanding().mean().droplevel(0)
# or
high_volume_days.groupby('symbol')['volume'].transform(lambda x: x.expanding().mean())
I have calendar dataframe as follows.
calendar = pd.DataFrame({"events": ["e1", "e2", "e3"],
"date_start": ["2021-02-01", "2021-02-06", "2021-02-03"],
"date_end":["2021-02-04", "2021-02-07", "2021-02-03"],
"country": ["us", "us", "uk"]})
calendar["date_start"] = pd.to_datetime(calendar["date_start"])
calendar["date_end"] = pd.to_datetime(calendar["date_end"])
and I have a daily dataframe as follows.
daily = pd.DataFrame({"date": pd.date_range(start="2021-02-01", end="2021-02-08"),
"value":[10, 20, 30, 40, 50, 60, 70, 80]})
I would like to take only events from US and join to the daily dataframe but the joining conditions are (date >= date_start) and (date <= date_end). So the expected output looks like this
date value events
2021-02-01 10 e1
2021-02-02 20 e1
2021-02-03 30 e1
2021-02-04 40 e1
2021-02-05 50
2021-02-06 60 e2
2021-02-07 70 e2
2021-02-08 80
I can do looping but it is not effective. May I have your suggestions how to do in the better way.
Use df.merge:
# Do a cross-join on the `tmp` column
In [2279]: x = calendar.assign(tmp=1).merge(daily.assign(tmp=1))
# Filter rows by providing your conditions
In [2284]: x = x[x.date.between(x.date_start, x.date_end) & x.country.eq('us')]
# Left-join with `daily` df to get all rows
In [2289]: ans = daily.merge(x[['date', 'events']], on='date', how='left')
In [2290]: ans
Out[2290]:
date value events
0 2021-02-01 10 e1
1 2021-02-02 20 e1
2 2021-02-03 30 e1
3 2021-02-04 40 e1
4 2021-02-05 50 NaN
5 2021-02-06 60 e2
6 2021-02-07 70 e2
7 2021-02-08 80 NaN
Here is a possible answer to your question.
import numpy as np
import pandas as pd
data_temp_1 = pd.merge(daily,calendar,how='cross')
data_temp_2 = data_temp_1.query('country=="us"')
indices = np.where((data_temp_2['date'] >= data_temp_2['date_start']) & (data_temp_2['date'] <= data_temp_2['date_end']),True,False)
final_df = data_temp_2[indices]
final_df.reset_index(drop=True,inplace=True)
To get the expected df we can use code
expected_df = pd.merge(daily,final_df,how='left')[['date','value','events']]
You can first explode the calendar and then merge on days:
calendar['date'] = [pd.date_range(s, e, freq='d') for s, e in
zip(calendar['date_start'], calendar['date_end'])]
calendar = calendar.explode('date').drop(['date_start', 'date_end'], axis=1)
events = calendar.merge(daily, how='inner', on='date')
us_events = events[events.country == 'us'].drop('country', axis=1)[['date', 'value', 'events']]
I think it is faster than the other answers provided (no apply).
One option for a non-equi join is the conditional_join from pyjanitor; underneath the hood it uses binary search to avoid a cartesian product; this can be helpful, depending on the data size:
# pip install pyjanitor
import janitor
import pandas as pd
(
daily
.conditional_join(
calendar,
("date", "date_start", ">="),
("date", "date_end", "<="),
how="left")
.loc[:, ['date', 'value', 'events']]
)
date value events
0 2021-02-01 10 e1
1 2021-02-02 20 e1
2 2021-02-03 30 e1
3 2021-02-03 30 e3
4 2021-02-04 40 e1
5 2021-02-05 50 NaN
6 2021-02-06 60 e2
7 2021-02-07 70 e2
8 2021-02-08 80 NaN
I could use some more help with a project. I am trying to analyze 4.5 million rows of data. I have read the data into a dataframe, have organized the data and now have 3 columns: 1) date as datetime 2) unique identifier 3) price
I need to calculate the year over year change in prices per item but the dates are not uniform and not consistent per item. For example:
date item price
12/31/15 A 110
12/31/15 B 120
12/31/14 A 100
6/24/13 B 100
What I would like is to find as a result is:
date item price previousdate % change
12/31/15 A 110 12/31/14 10%
12/31/15 B 120 6/24/13 20%
12/31/14 A 100
6/24/13 B 100
EDIT - Better example of data
date item price
6/1/2016 A 276.3457646
6/1/2016 B 5.044165645
4/27/2016 B 4.91300186
4/27/2016 A 276.4329163
4/20/2016 A 276.9991265
4/20/2016 B 4.801263717
4/13/2016 A 276.1950213
4/13/2016 B 5.582923328
4/6/2016 B 5.017863509
4/6/2016 A 276.218649
3/30/2016 B 4.64274783
3/30/2016 A 276.554653
3/23/2016 B 5.576438253
3/23/2016 A 276.3135836
3/16/2016 B 5.394435443
3/16/2016 A 276.4222986
3/9/2016 A 276.8929462
3/9/2016 B 4.999951262
3/2/2016 B 4.731349423
3/2/2016 A 276.3972068
1/27/2016 A 276.8458971
1/27/2016 B 4.993033132
1/20/2016 B 5.250379701
1/20/2016 A 276.2899864
1/13/2016 B 5.146639666
1/13/2016 A 276.7041978
1/6/2016 B 5.328296958
1/6/2016 A 276.9465891
12/30/2015 B 5.312301356
12/30/2015 A 256.259668
12/23/2015 B 5.279105491
12/23/2015 A 255.8411198
12/16/2015 B 5.150798234
12/16/2015 A 255.8360529
12/9/2015 A 255.4915183
12/9/2015 B 4.722876886
12/2/2015 A 256.267146
12/2/2015 B 5.083626167
10/28/2015 B 4.876177757
10/28/2015 A 255.6464653
10/21/2015 B 4.551439655
10/21/2015 A 256.1735769
10/14/2015 A 255.9752668
10/14/2015 B 4.693967392
10/7/2015 B 4.911797443
10/7/2015 A 256.2556707
9/30/2015 B 4.262994526
9/30/2015 A 255.8068691
7/1/2015 A 255.7312385
4/22/2015 A 234.6210132
4/15/2015 A 235.3902076
4/15/2015 B 4.154926102
4/1/2015 A 234.4713827
2/25/2015 A 235.1391496
2/18/2015 A 235.1223471
What I have done (with some help from other users) hasn't worked but is below. Thanks for any help you guys can provide or pointing me in the right direction!
import pandas as pd
import datetime as dt
import numpy as np
df = pd.read_csv('...python test file5.csv',parse_dates =['As of Date'])
df = df[['item','price','As of Date']]
def get_prev_year_price(x, df):
try:
return df.loc[x['prev_year_date'], 'price']
#return np.abs(df.time - x)
except Exception as e:
return x['price']
#Function to determine the closest date from given date and list of all dates
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
df['As of Date'] = pd.to_datetime(df['As of Date'],format='%m/%d/%Y')
df = df.rename(columns = {df.columns[2]:'date'})
# list of dates
dtlst = [item for item in df['date']]
data = []
data2 = []
for item in df['item'].unique():
item_df = df[df['item'] == item] #select based on items
select_dates = item_df['date'].unique()
item_df.set_index('date', inplace=True) #set date as key index
item_df = item_df.resample('D').mean().reset_index() #fill in missing date
item_df['price'] = item_df['price'].interpolate('nearest') #fill in price with nearest price available
# use max(item_df['date'] where item_df['date'] < item_df['date'] - pd.DateOffset(years=1, days=1))
#possible_date = item_df['date'] - pd.DateOffset(years=1)
#item_df['prev_year_date'] = max(df[df['date'] <= possible_date])
item_df['prev_year_date'] = item_df['date'] - pd.DateOffset(years=1) #calculate 1 year ago date
date_df = item_df[item_df.date.isin(select_dates)] #select dates with useful data
item_df.set_index('date', inplace=True)
date_df['prev_year_price'] = date_df.apply(lambda x: get_prev_year_price(x, item_df),axis=1)
#date_df['prev_year_price'] = date_df.apply(lambda x: nearest(dtlst, x),axis=1)
date_df['change'] = date_df['price'] / date_df['prev_year_price']-1
date_df['item'] = item
data.append(date_df)
data2.append(item_df)
summary = pd.concat(data).sort_values('date', ascending=False)
#print (summary)
#saving the output of the CSV file to see how data looks after being handled
filename = '...python_test_file_save4.csv'
summary.to_csv(filename, index=True, encoding='utf-8')
With current usecase assumptions, this works out for this specific usecase
In [2459]: def change(grp):
...: grp['% change'] = grp.price.diff()
...: grp['previousdate'] = grp.date.shift(1)
...: return grp
Sort on date then groupby and apply the change function, then sort the index back.
In [2460]: df.sort_values('date').groupby('item').apply(change).sort_index()
Out[2460]:
date item price % change previousdate
0 2015-12-31 A 110 10.0 2014-12-31
1 2015-12-31 B 120 20.0 2013-06-24
2 2014-12-31 A 100 NaN NaT
3 2013-06-24 B 100 NaN NaT
This is a good situation for merge_asof, which merges two dataframes by finding the last row of the right dataframe that is less than the key to the left dataframe. We need to add a year to the right dataframe first, since the requirement is 1 year or more difference between dates.
Here is some sample data that you brought up in your comment.
date item price
12/31/15 A 110
12/31/15 B 120
12/31/14 A 100
6/24/13 B 100
12/31/15 C 100
1/31/15 C 80
11/14/14 C 130
11/19/13 C 110
11/14/13 C 200
The dates need to be sorted for merge_asof to work. merge_asof also drops the joining column so we need to put a copy of that back in our right dataframe.
Setup dataframes
df = df.sort_values('date')
df_copy = df.copy()
df_copy['previousdate'] = df_copy['date']
df_copy['date'] += pd.DateOffset(years=1)
Use merge_asof
df_final = pd.merge_asof(df, df_copy,
on='date',
by='item',
suffixes=['current', 'previous'])
df_final['% change'] = (df_final['pricecurrent'] - df_final['priceprevious']) / df_final['priceprevious']
df_final
date item pricecurrent priceprevious previousdate % change
0 2013-06-24 B 100 NaN NaT NaN
1 2013-11-14 C 200 NaN NaT NaN
2 2013-11-19 C 110 NaN NaT NaN
3 2014-11-14 C 130 200.0 2013-11-14 -0.350000
4 2014-12-31 A 100 NaN NaT NaN
5 2015-01-31 C 80 110.0 2013-11-19 -0.272727
6 2015-12-31 A 110 100.0 2014-12-31 0.100000
7 2015-12-31 B 120 100.0 2013-06-24 0.200000
8 2015-12-31 C 100 130.0 2014-11-14 -0.230769
Assume that I have the following data set
import pandas as pd, numpy, datetime
start, end = datetime.datetime(2015, 1, 1), datetime.datetime(2015, 12, 31)
date_list = pd.date_range(start, end, freq='B')
numdays = len(date_list)
value = numpy.random.normal(loc=1e3, scale=50, size=numdays)
ids = numpy.repeat([1], numdays)
test_df = pd.DataFrame({'Id': ids,
'Date': date_list,
'Value': value})
I would now like to calculate the maximum within each business quarter for test_df. One possiblity is to use resample using rule='BQ', how='max'. However, I'd like to keep the structure of the array and just generate another column with the maximum for each BQ, have you guys got any suggestions on how to do this?
I think the following should work for you, this groups on the quarter and calls transform on the 'Value' column and returns the maximum value as a Series with it's index aligned to the original df:
In [26]:
test_df['max'] = test_df.groupby(test_df['Date'].dt.quarter)['Value'].transform('max')
test_df
Out[26]:
Date Id Value max
0 2015-01-01 1 1005.498555 1100.197059
1 2015-01-02 1 1032.235987 1100.197059
2 2015-01-05 1 986.906171 1100.197059
3 2015-01-06 1 984.473338 1100.197059
........
256 2015-12-25 1 997.965285 1145.215837
257 2015-12-28 1 929.652812 1145.215837
258 2015-12-29 1 1086.128017 1145.215837
259 2015-12-30 1 921.663949 1145.215837
260 2015-12-31 1 938.189566 1145.215837
[261 rows x 4 columns]