I have calendar dataframe as follows.
calendar = pd.DataFrame({"events": ["e1", "e2", "e3"],
"date_start": ["2021-02-01", "2021-02-06", "2021-02-03"],
"date_end":["2021-02-04", "2021-02-07", "2021-02-03"],
"country": ["us", "us", "uk"]})
calendar["date_start"] = pd.to_datetime(calendar["date_start"])
calendar["date_end"] = pd.to_datetime(calendar["date_end"])
and I have a daily dataframe as follows.
daily = pd.DataFrame({"date": pd.date_range(start="2021-02-01", end="2021-02-08"),
"value":[10, 20, 30, 40, 50, 60, 70, 80]})
I would like to take only events from US and join to the daily dataframe but the joining conditions are (date >= date_start) and (date <= date_end). So the expected output looks like this
date value events
2021-02-01 10 e1
2021-02-02 20 e1
2021-02-03 30 e1
2021-02-04 40 e1
2021-02-05 50
2021-02-06 60 e2
2021-02-07 70 e2
2021-02-08 80
I can do looping but it is not effective. May I have your suggestions how to do in the better way.
Use df.merge:
# Do a cross-join on the `tmp` column
In [2279]: x = calendar.assign(tmp=1).merge(daily.assign(tmp=1))
# Filter rows by providing your conditions
In [2284]: x = x[x.date.between(x.date_start, x.date_end) & x.country.eq('us')]
# Left-join with `daily` df to get all rows
In [2289]: ans = daily.merge(x[['date', 'events']], on='date', how='left')
In [2290]: ans
Out[2290]:
date value events
0 2021-02-01 10 e1
1 2021-02-02 20 e1
2 2021-02-03 30 e1
3 2021-02-04 40 e1
4 2021-02-05 50 NaN
5 2021-02-06 60 e2
6 2021-02-07 70 e2
7 2021-02-08 80 NaN
Here is a possible answer to your question.
import numpy as np
import pandas as pd
data_temp_1 = pd.merge(daily,calendar,how='cross')
data_temp_2 = data_temp_1.query('country=="us"')
indices = np.where((data_temp_2['date'] >= data_temp_2['date_start']) & (data_temp_2['date'] <= data_temp_2['date_end']),True,False)
final_df = data_temp_2[indices]
final_df.reset_index(drop=True,inplace=True)
To get the expected df we can use code
expected_df = pd.merge(daily,final_df,how='left')[['date','value','events']]
You can first explode the calendar and then merge on days:
calendar['date'] = [pd.date_range(s, e, freq='d') for s, e in
zip(calendar['date_start'], calendar['date_end'])]
calendar = calendar.explode('date').drop(['date_start', 'date_end'], axis=1)
events = calendar.merge(daily, how='inner', on='date')
us_events = events[events.country == 'us'].drop('country', axis=1)[['date', 'value', 'events']]
I think it is faster than the other answers provided (no apply).
One option for a non-equi join is the conditional_join from pyjanitor; underneath the hood it uses binary search to avoid a cartesian product; this can be helpful, depending on the data size:
# pip install pyjanitor
import janitor
import pandas as pd
(
daily
.conditional_join(
calendar,
("date", "date_start", ">="),
("date", "date_end", "<="),
how="left")
.loc[:, ['date', 'value', 'events']]
)
date value events
0 2021-02-01 10 e1
1 2021-02-02 20 e1
2 2021-02-03 30 e1
3 2021-02-03 30 e3
4 2021-02-04 40 e1
5 2021-02-05 50 NaN
6 2021-02-06 60 e2
7 2021-02-07 70 e2
8 2021-02-08 80 NaN
Related
This question already has answers here:
How to join two dataframes for which column values are within a certain range?
(9 answers)
Closed 7 days ago.
Existing dataframe :
df_1
Id dates time(sec)_1 time(sec)_2
1 02/02/2022 15 20
1 04/02/2022 20 30
1 03/02/2022 30 40
1 06/02/2022 50 40
2 10/02/2022 10 10
2 11/02/2022 15 20
df_2
Id min_date action_date
1 02/02/2022 04/02/2022
2 06/02/2022 10/02/2022
Expected Dataframe :
df_2
Id min_date action_date count_of_dates avg_time_1 avg_time_2
1 02/02/2022 04/02/2022 3 21.67 30
2 06/02/2022 10/02/2022 1 10 10
count of dates, avg_time_1 , avg_time_2 is to be created from the df_1.
count of dates is calculated considering the min_date and action_date i.e. number of dates from from df_1 falling under min_date and action_date.
avg_time_1 and avg_time_2 are calculated w.r.t. to count of dates
stuck with applying the condition for dates :-( any leads.?
If small data is possible filter per rows by custom function:
df_1['dates'] = df_1['dates'].apply(pd.to_datetime)
df_2[['min_date','action_date']] = df_2[['min_date','action_date']].apply(pd.to_datetime)
def f(x):
m = df_1['Id'].eq(x['Id']) & df_1['dates'].between(x['min_date'], x['action_date'])
s = df_1.loc[m, ['time(sec)_1','time(sec)_2']].mean()
return pd.Series([m.sum()] + s.to_list(), index=['count_of_dates'] + s.index.tolist())
df = df_2.join(df_2.apply(f, axis=1))
print (df)
Id min_date action_date count_of_dates time(sec)_1 time(sec)_2
0 1 2022-02-02 2022-04-02 3.0 21.666667 30.0
1 2 2022-06-02 2022-10-02 1.0 10.000000 10.0
If Id in df_2 is unique is possible improve performance by merge df_1 with aggregate size and mean:
df = df_2.merge(df_1, on='Id')
d = {'count_of_dates':('Id','size'),
'time(sec)_1':('time(sec)_1','mean'),
'time(sec)_2':('time(sec)_2','mean')}
df = df_2.join(df[df['dates'].between(df['min_date'], df['action_date'])]
.groupby('Id').agg(**d), on='Id')
print (df)
Id min_date action_date count_of_dates time(sec)_1 time(sec)_2
0 1 2022-02-02 2022-04-02 3 21.666667 30
1 2 2022-06-02 2022-10-02 1 10.000000 10
I have a dataframe as shown below:
A B ans_0 ans_3 ans_4
timestamp
2022-05-09 09:28:00 0 45 20 200 100
2022-05-09 09:28:01 3 100 10 80 50
2022-05-09 09:28:02 4 30 30 60 10
In this dataframe, the values in column A are present as a part of the column names. That is, the values 0,3 and 4 of column A are present in the column name ans_0, ans_3 and ans_4.
My goal is, for each row, the value in column A is compared with the row.index and if it matches, the value present in that particular column is taken and put in column B.
The output should look as shown below:
A B ans_0 ans_3 ans_4
timestamp
2022-05-09 09:28:00 0 20 20 200 100
2022-05-09 09:28:01 3 80 10 80 50
2022-05-09 09:28:02 4 10 30 60 10
For eg: In the first row, the value 0 from column A is compared and matched with the column ans_0. The value present which is 20 is put in column B. column B had a value of 45 which is replaced by 20.
Is there an easier way to do this?
Thanks!
You need to use indexing lookup, for this you first need to ensure that the names in A match the column names (0 -> 'ans_0'):
idx, cols = pd.factorize('ans_'+df['A'].astype(str))
import numpy as np
df['B'] = (df.reindex(cols, axis=1).to_numpy()
[np.arange(len(df)), idx]
)
output:
A B ans_0 ans_3 ans_4
timestamp
2022-05-09 09:28:00 0 20 20 200 100
2022-05-09 09:28:01 3 80 10 80 50
2022-05-09 09:28:02 4 10 30 60 10
You could reindex the ans columns with A column values; then get the values on the diagonal:
import numpy as np
df.columns = df.columns.str.split('_', expand=True)
df['B'] = np.diag(df['ans'].reindex(df['A'].squeeze().astype('string'), axis=1))
df.columns = [f"{i}_{j}" if j==j else i for i,j in df.columns]
Output:
A B ans_0 ans_3 ans_4
timestamp
2022-05-09 09:28:00 0 20 20 200 100
2022-05-09 09:28:01 3 80 10 80 50
2022-05-09 09:28:02 4 10 30 60 10
I have two dataframes and for one I want to find the closest (previous) date in the other.
If the date matches then I need to take the previous date
df_main contains the reference information
For df_sample I want to lookup the Time in df_main for the closest (but previous) entry. I can do this using method='ffill' , but where the date for the Time field is the same day it returns that day - I want it to return the previous - basically a < rather than <=.
In my example df_res I want the closest_val column to contain [ "n/a", 90, 90, 280, 280, 280]
import pandas as pd
dsample = {'Index': [1, 2, 3, 4, 5, 6],
'Time': ["2020-06-01", "2020-06-02", "2020-06-03", "2020-06-04" ,"2020-06-05" ,"2020-06-06"],
'Pred': [100, -200, 300, -400 , -500, 600]
}
dmain = {'Index': [1, 2, 3],
'Time': ["2020-06-01", "2020-06-03","2020-06-06"],
'Actual': [90, 280, 650]
}
def find_closest(x, df2):
df_res = df2.iloc[df2.index.get_loc(x['Time'], method='ffill')]
x['closest_time'] = df_res['Time']
x['closest_val'] = df_res['Actual']
return x
df_sample = pd.DataFrame(data=dsample)
df_main = pd.DataFrame(data=dmain)
df_sample = df_sample.set_index(pd.DatetimeIndex(df_sample['Time']))
df_main = df_main.set_index(pd.DatetimeIndex(df_main['Time']))
df_res = df_sample.apply(find_closest, df2=df_main ,axis=1)
Use pd.merge_asof (make sure 'Time' is indeed a datetime):
pd.merge_asof(dsample, dmain, left_on="Time", right_on="Time", allow_exact_matches=False)
The output is:
Index_x Time Pred Index_y Actual
0 1 2020-06-01 100 NaN NaN
1 2 2020-06-02 -200 1.0 90.0
2 3 2020-06-03 300 1.0 90.0
3 4 2020-06-04 -400 2.0 280.0
4 5 2020-06-05 -500 2.0 280.0
5 6 2020-06-06 600 2.0 280.0
IIUC, we can do a Cartesian product of both your dataframes, then filter out the exact matches, then apply some logic to figure out the closest date.
Finally, we will join your extact, and non exact matches into a final dataframe.
s = pd.merge(
df_sample.assign(key="var1"),
df_main.assign(key="var1").rename(columns={"Time": "TimeDelta"}).drop("Index", 1),
on="key",
how="outer",
).drop("key", 1)
extact_matches = s[s['Time'].eq(s['TimeDelta'])]
non_exact_matches_cart = s[~s['Time'].isin(extact_matches['Time'])]
non_exact_matches = non_exact_matches_cart.assign(
delta=(non_exact_matches_cart["Time"] - non_exact_matches_cart["TimeDelta"])
/ np.timedelta64(1, "D")
).query("delta >= 0").sort_values(["Time", "delta"]).drop_duplicates(
"Time", keep="first"
).drop('delta',1)
alot to take in the above variable, but essentially we are finding the difference in time, removing any difference that goes into the future, and dropping the values keeping the closest date in the past.
df = pd.concat([extact_matches, non_exact_matches], axis=0).sort_values("Time").rename(
columns={"TimeDelta": "closest_time", "Actual": "closest val"}
)
print(df)
Index Time Pred closest_time closest val
0 1 2020-06-01 100 2020-06-01 90
3 2 2020-06-02 -200 2020-06-01 90
7 3 2020-06-03 300 2020-06-03 280
10 4 2020-06-04 -400 2020-06-03 280
13 5 2020-06-05 -500 2020-06-03 280
17 6 2020-06-06 600 2020-06-06 650
I could use some more help with a project. I am trying to analyze 4.5 million rows of data. I have read the data into a dataframe, have organized the data and now have 3 columns: 1) date as datetime 2) unique identifier 3) price
I need to calculate the year over year change in prices per item but the dates are not uniform and not consistent per item. For example:
date item price
12/31/15 A 110
12/31/15 B 120
12/31/14 A 100
6/24/13 B 100
What I would like is to find as a result is:
date item price previousdate % change
12/31/15 A 110 12/31/14 10%
12/31/15 B 120 6/24/13 20%
12/31/14 A 100
6/24/13 B 100
EDIT - Better example of data
date item price
6/1/2016 A 276.3457646
6/1/2016 B 5.044165645
4/27/2016 B 4.91300186
4/27/2016 A 276.4329163
4/20/2016 A 276.9991265
4/20/2016 B 4.801263717
4/13/2016 A 276.1950213
4/13/2016 B 5.582923328
4/6/2016 B 5.017863509
4/6/2016 A 276.218649
3/30/2016 B 4.64274783
3/30/2016 A 276.554653
3/23/2016 B 5.576438253
3/23/2016 A 276.3135836
3/16/2016 B 5.394435443
3/16/2016 A 276.4222986
3/9/2016 A 276.8929462
3/9/2016 B 4.999951262
3/2/2016 B 4.731349423
3/2/2016 A 276.3972068
1/27/2016 A 276.8458971
1/27/2016 B 4.993033132
1/20/2016 B 5.250379701
1/20/2016 A 276.2899864
1/13/2016 B 5.146639666
1/13/2016 A 276.7041978
1/6/2016 B 5.328296958
1/6/2016 A 276.9465891
12/30/2015 B 5.312301356
12/30/2015 A 256.259668
12/23/2015 B 5.279105491
12/23/2015 A 255.8411198
12/16/2015 B 5.150798234
12/16/2015 A 255.8360529
12/9/2015 A 255.4915183
12/9/2015 B 4.722876886
12/2/2015 A 256.267146
12/2/2015 B 5.083626167
10/28/2015 B 4.876177757
10/28/2015 A 255.6464653
10/21/2015 B 4.551439655
10/21/2015 A 256.1735769
10/14/2015 A 255.9752668
10/14/2015 B 4.693967392
10/7/2015 B 4.911797443
10/7/2015 A 256.2556707
9/30/2015 B 4.262994526
9/30/2015 A 255.8068691
7/1/2015 A 255.7312385
4/22/2015 A 234.6210132
4/15/2015 A 235.3902076
4/15/2015 B 4.154926102
4/1/2015 A 234.4713827
2/25/2015 A 235.1391496
2/18/2015 A 235.1223471
What I have done (with some help from other users) hasn't worked but is below. Thanks for any help you guys can provide or pointing me in the right direction!
import pandas as pd
import datetime as dt
import numpy as np
df = pd.read_csv('...python test file5.csv',parse_dates =['As of Date'])
df = df[['item','price','As of Date']]
def get_prev_year_price(x, df):
try:
return df.loc[x['prev_year_date'], 'price']
#return np.abs(df.time - x)
except Exception as e:
return x['price']
#Function to determine the closest date from given date and list of all dates
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
df['As of Date'] = pd.to_datetime(df['As of Date'],format='%m/%d/%Y')
df = df.rename(columns = {df.columns[2]:'date'})
# list of dates
dtlst = [item for item in df['date']]
data = []
data2 = []
for item in df['item'].unique():
item_df = df[df['item'] == item] #select based on items
select_dates = item_df['date'].unique()
item_df.set_index('date', inplace=True) #set date as key index
item_df = item_df.resample('D').mean().reset_index() #fill in missing date
item_df['price'] = item_df['price'].interpolate('nearest') #fill in price with nearest price available
# use max(item_df['date'] where item_df['date'] < item_df['date'] - pd.DateOffset(years=1, days=1))
#possible_date = item_df['date'] - pd.DateOffset(years=1)
#item_df['prev_year_date'] = max(df[df['date'] <= possible_date])
item_df['prev_year_date'] = item_df['date'] - pd.DateOffset(years=1) #calculate 1 year ago date
date_df = item_df[item_df.date.isin(select_dates)] #select dates with useful data
item_df.set_index('date', inplace=True)
date_df['prev_year_price'] = date_df.apply(lambda x: get_prev_year_price(x, item_df),axis=1)
#date_df['prev_year_price'] = date_df.apply(lambda x: nearest(dtlst, x),axis=1)
date_df['change'] = date_df['price'] / date_df['prev_year_price']-1
date_df['item'] = item
data.append(date_df)
data2.append(item_df)
summary = pd.concat(data).sort_values('date', ascending=False)
#print (summary)
#saving the output of the CSV file to see how data looks after being handled
filename = '...python_test_file_save4.csv'
summary.to_csv(filename, index=True, encoding='utf-8')
With current usecase assumptions, this works out for this specific usecase
In [2459]: def change(grp):
...: grp['% change'] = grp.price.diff()
...: grp['previousdate'] = grp.date.shift(1)
...: return grp
Sort on date then groupby and apply the change function, then sort the index back.
In [2460]: df.sort_values('date').groupby('item').apply(change).sort_index()
Out[2460]:
date item price % change previousdate
0 2015-12-31 A 110 10.0 2014-12-31
1 2015-12-31 B 120 20.0 2013-06-24
2 2014-12-31 A 100 NaN NaT
3 2013-06-24 B 100 NaN NaT
This is a good situation for merge_asof, which merges two dataframes by finding the last row of the right dataframe that is less than the key to the left dataframe. We need to add a year to the right dataframe first, since the requirement is 1 year or more difference between dates.
Here is some sample data that you brought up in your comment.
date item price
12/31/15 A 110
12/31/15 B 120
12/31/14 A 100
6/24/13 B 100
12/31/15 C 100
1/31/15 C 80
11/14/14 C 130
11/19/13 C 110
11/14/13 C 200
The dates need to be sorted for merge_asof to work. merge_asof also drops the joining column so we need to put a copy of that back in our right dataframe.
Setup dataframes
df = df.sort_values('date')
df_copy = df.copy()
df_copy['previousdate'] = df_copy['date']
df_copy['date'] += pd.DateOffset(years=1)
Use merge_asof
df_final = pd.merge_asof(df, df_copy,
on='date',
by='item',
suffixes=['current', 'previous'])
df_final['% change'] = (df_final['pricecurrent'] - df_final['priceprevious']) / df_final['priceprevious']
df_final
date item pricecurrent priceprevious previousdate % change
0 2013-06-24 B 100 NaN NaT NaN
1 2013-11-14 C 200 NaN NaT NaN
2 2013-11-19 C 110 NaN NaT NaN
3 2014-11-14 C 130 200.0 2013-11-14 -0.350000
4 2014-12-31 A 100 NaN NaT NaN
5 2015-01-31 C 80 110.0 2013-11-19 -0.272727
6 2015-12-31 A 110 100.0 2014-12-31 0.100000
7 2015-12-31 B 120 100.0 2013-06-24 0.200000
8 2015-12-31 C 100 130.0 2014-11-14 -0.230769
I have a dataframe with the below biweekly data
date value
15-06-2012 20
30-06-2012 30
And I need to join with another dataframe that has below data:
date cost
2-05-2011 5
3-04-2012 80
2-06-2012 10
3-06-2012 10
4-06-2012 30
5-06-2012 20
10-06-2012 10
15-06-2012 10
18-06-2012 30
20-06-2012 20
21-06-2012 30
22-06-2012 30
29-06-2012 20
29-10-2012 30
I need to join 2 dataframes in such a way that from another dataframe, i get average cost between 1-15 th june 2012 to fill 15-06-2012 cost and similarly for 30-06-2012 cost, I get avg value between 16-06-2012 to 30-06-2012 and get the below results
date value cost
15-06-2012 20 15 which is (10+10+30+20+10+10)/6
30-06-2012 30 26 which is (30+20+30+30+20)/5
Change to datetime of your columns date , then we using merge_asof
#df.date=pd.to_datetime(df.date,dayfirst=True)
#df1.date=pd.to_datetime(df1.date,dayfirst=True)
df['keepkey']=df.date
mergedf=pd.merge_asof(df1,df,on='date',direction ='forward')
mergedf.groupby('keepkey',as_index=False).mean()
Out[373]:
keepkey cost value
0 2012-06-15 15 20
1 2012-06-30 26 30
Update :
df['keepkey']=df.date
df['key']=df.date.dt.strftime('%Y-%m')
df1['key']=df1.date.dt.strftime('%Y-%m')
mergedf=pd.merge_asof(df1,df,on='date',by='key',direction ='forward')
mergedf.groupby('keepkey',as_index=False).mean()
Out[417]:
keepkey cost key value
0 2012-06-15 15 6 20.0
1 2012-06-30 26 6 30.0
This would need a merge followed by a groupby:
m = df.merge(df2, on='date', how='outer')
m['date'] = pd.to_datetime(m.date, dayfirst=True)
m = m.sort_values('date')
(m.groupby(m['value'].notnull().shift().fillna(False).cumsum(),
as_index=False)
.agg({'date' : 'last', 'cost' : 'mean', 'value' : 'last'}))
date cost value
0 2012-06-15 15.0 20.0
1 2012-06-30 26.0 30.0