I have a table like this (with more columns):
date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825
I have created some features like this:
sectorGroup = df.groupby(["date","Sector"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby(["date"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,dateGroupGroup,on=["date"],how="left",suffixes=["","_byDate"])
Now my new df looks like this:
date,Sector,Value1,Value2,Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
14/03/22,Medical,86,64,275.0,81.5,281.75,260.25
14/03/22,Medical,464,99,275.0,81.5,281.75,260.25
14/03/22,Industry,22,35,22.0,35.0,281.75,260.25
14/03/22,Services,555,843,555.0,843.0,281.75,260.25
15/03/22,Services,111,533,111.0,533.0,1634.75,616.0
15/03/22,Industry,222,169,222.0,169.0,1634.75,616.0
15/03/22,Medical,672,937,3103.0,881.0,1634.75,616.0
15/03/22,Medical,5534,825,3103.0,881.0,1634.75,616.0
Now, I want to create lag features for Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
For example, a new column named Value1_by_Date_lag1 and Value1_bySector_lag1.
And this new column will look like this:
date,Sector,Value1_by_Date_lag1,Value1_bySector_lag1
15/03/22,Services,281.75,555.0
15/03/22,Industry,281.75,22.0
15/03/22,Medical,281.75,275.0
15/03/22,Medical,281.75,275.0
Basically in Value1_by_Date_lag1, the date "15/03" will contain the value "281.75" which is for the date "14/03" (lag of 1 shift).
Basically in Value1_bySector_lag1, the date "15/03" and Sector "Medical" will contain the value "275.0", which is the value for "14/03" and "Medical" rows.
I hope, the question is clear and gave you all the details.
Create a lagged date variable by shifting the date column, and then merge again with dateGroupGroup and sectorGroup using the lagged date instead of the actual date.
df = pd.read_csv(io.StringIO("""date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825"""))
# Add a lagged date variable
lagged = df.groupby("date")["date"].first().shift()
df = df.join(lagged, on="date", rsuffix="_lag")
# Create date and sector groups and merge them into df, as you already do
sectorGroup = df.groupby(["date","Sector"])[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby("date")[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df, dateGroupGroup, on="date",how="left", suffixes=["","_byDate"])
# Merge again, this time matching the lagged date in df to the actual date in sectorGroup and dateGroupGroup
df = pd.merge(df, sectorGroup, left_on=["date_lag", "Sector"], right_on=["date", "Sector"], how="left", suffixes=["", "_by_sector_lag"])
df = pd.merge(df, dateGroupGroup, left_on="date_lag", right_on="date", how="left", suffixes=["", "_by_date_lag"])
# Drop the extra unnecessary columns that have been created in the merge
df = df.drop(columns=['date_by_date_lag', 'date_by_sector_lag'])
This assumes the data is sorted by date - if not you will have to sort before generating the lagged date. It will work whether or not all the dates are consecutive.
I found 1 inefficient solution (slow and memory intensive).
Lag of "date" group
cols = ["Value1_byDate","Value2_byDate"]
temp = df[["date"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp.date = temp.date.shift(-1-i)
df = pd.merge(df,temp,on="date",how="left",suffixes=["","_lag"+str(i+1)])
Lag of "date" and "Sector" group
cols = ["Value1_bySector","Value2_bySector"]
temp = df[["date","Sector"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp[["Value1_bySector","Value2_bySector"]] = temp.groupby("Sector")["Value1_bySector","Value2_bySector"].shift(1+1)
df = pd.merge(df,temp,on=["date","Sector"],how="left",suffixes=["","_lag"+str(i+1)])
Is there a more simple solution?
I have df with the endDate column, which I want to use as my x array and discountFactor column as y array. The goal is to interpolate discountFactor at some date within the range of x array, let's say 2020-12-31 using cubic spline method.
The source df below:
lst = [['6M', '2020-10-26', '2021-04-26', '-0.521684','1.002611'], ['1Y', '2020-10-26', '2021-10-26', '-0.534855','1.005377'],
['5Y', '2020-10-26', '2025-10-27','-0.495927','1.025184']]
df = pd.DataFrame(lst, columns =['tenor', 'startDate', 'endDate','ratePercent','discountFactor'], dtype = float)
Please see the following steps, which unfortunately lead to df with all NaN columns, even there where I had known discountFacor values
# Converting date column strings to date objects
df['endDate'] = pd.to_datetime(df['endDate'], format='%Y-%m-%d')
df['startDate'] = pd.to_datetime(df['startDate'], format='%Y-%m-%d')
# Setting endDate column as my index
df.set_index("endDate", inplace= True)
# Creating daily dates range between start and end dates from the df
dates_range = pd.date_range(start='2020-10-26', end='2025-10-27')
idx = pd.DatetimeIndex(dates_range)
df = df.reindex(idx)
I currently have a df in pandas with a variable called 'Dates' that records the data an complaint was filed.
data = pd.read_csv("filename.csv")
Dates
Initially Received
07-MAR-08
08-APR-08
19-MAY-08
As you can see there are missing dates between when complaints are filed, also multiple complaints may have been filed on the same day. Is there a way to fill in the missing days while keeping complaints that were filed on the same day the same?
I tried creating a new df with datetime and merging the dataframes together,
days = pd.date_range(start='01-JAN-2008', end='31-DEC-2017')
df = pd.DataFrame(data=days)
df.index = range(3653)
dates = pd.merge(days, data['Dates'], how='inner')
but I get the following error:
ValueError: can not merge DataFrame with instance of type <class
'pandas.tseries.index.DatetimeIndex'>
Here are the first four rows of data
You were close, there's an issue with your input
First do:
df = pd.read_csv('filename.csv', skiprows = 1)
Then
days = pd.date_range(start='01-JAN-2008', end='31-DEC-2017')
df_clean = df.reset_index()
df_clean['idx dates'] = pd.to_datetime(df_clean['Initially Received'])
df2 = pd.DataFrame(data=days, index = range(3653), columns=['full dates'])
dates = pd.merge(df2, df_clean, left_on='full dates', right_on = 'idx dates', how='left')
Create your date range, and use merge to outer join it to the original dataframe, preserving duplicates.
import pandas as pd
from io import StringIO
TESTDATA = StringIO(
"""Dates;fruit
05-APR-08;apple
08-APR-08;banana
08-APR-08;pear
11-APR-08;grapefruit
""")
df = pd.read_csv(TESTDATA, sep=';', parse_dates=['Dates'])
dates = pd.date_range(start='04-APR-2008', end='12-APR-2008').to_frame()
pd.merge(
df, dates, left_on='Dates', right_on=0,
how='outer').sort_values(by=['Dates']).drop(columns=0)
# Dates fruit
# 2008-04-04 NaN
# 2008-04-05 apple
# 2008-04-06 NaN
# 2008-04-07 NaN
# 2008-04-08 banana
# 2008-04-08 pear
# 2008-04-09 NaN
# 2008-04-10 NaN
# 2008-04-11 grapefruit
# 2008-04-12 NaN
I have data like this:
import datetime as dt
import pandas as pd
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
I would like to create a third column which contains a date range created by pd.date_range, using 'date' as the start date and 'n' as the number of periods.
So the first entry should be:
pd.date_range(dt.datetime(2018,8,25), periods=10, freq='d')
(I have a list of "target" dates, and my goal is to check whether the date_range contains any of those target dates).
I tried this:
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'],
x['n'],
freq='d'))
But this gives a KeyError: ('date', 'occurred at index date')
Any idea on how to do this without using a for loop, or is there a better solution altogether?
You can solve your problem without creating date range or day columns. To check if a target date in tgt belongs to a date range specified by rows of df, you can calculate the end of date range, and then check if each date in tgt falls in between the start and end of a time interval. The code below implements this, and produces "target_date" column identical to the one in your own answer:
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
df["daterange_end"] = df.apply(lambda x: x["date"] + pd.Timedelta(days=x["n"]), axis=1)
tgt = [dt.datetime(2018,8,26)]
df['target_date'] = 0
df.loc[(tgt[0] > df.date) &(tgt[0] < df.daterange_end),"target_date"] = 1
print(df)
# date n daterange_end target_date
# 0 2018-08-25 10 2018-09-04 1
# 1 2018-07-21 7 2018-07-28 0
You should add axis=1 in apply
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'], x['n'], freq='d'), axis=1)
I came up with a solution that works (but I'm sure there's a nicer way...)
# define target
tgt = [dt.datetime(2018,8,26)]
# find max n
max_n = max(df['n'])
# create that many columns and increment the day
for i in range(max_n):
df['date_{}'.format(i)] = df['date'] + dt.timedelta(days=i)
new_cols = ['date_{}'.format(n) for n in range(max_n)]
# check each one and replace with a 1 if it matches the "tgt"
df['target_date'] = 0
for col in new_cols:
df['target_date'] = np.where(df[col].isin(tgt),
1,
df['target_date'])
# drop intermediate cols
df = df[[i for i in df.columns if not i in new_cols]]