I have data like this:
import datetime as dt
import pandas as pd
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
I would like to create a third column which contains a date range created by pd.date_range, using 'date' as the start date and 'n' as the number of periods.
So the first entry should be:
pd.date_range(dt.datetime(2018,8,25), periods=10, freq='d')
(I have a list of "target" dates, and my goal is to check whether the date_range contains any of those target dates).
I tried this:
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'],
x['n'],
freq='d'))
But this gives a KeyError: ('date', 'occurred at index date')
Any idea on how to do this without using a for loop, or is there a better solution altogether?
You can solve your problem without creating date range or day columns. To check if a target date in tgt belongs to a date range specified by rows of df, you can calculate the end of date range, and then check if each date in tgt falls in between the start and end of a time interval. The code below implements this, and produces "target_date" column identical to the one in your own answer:
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
df["daterange_end"] = df.apply(lambda x: x["date"] + pd.Timedelta(days=x["n"]), axis=1)
tgt = [dt.datetime(2018,8,26)]
df['target_date'] = 0
df.loc[(tgt[0] > df.date) &(tgt[0] < df.daterange_end),"target_date"] = 1
print(df)
# date n daterange_end target_date
# 0 2018-08-25 10 2018-09-04 1
# 1 2018-07-21 7 2018-07-28 0
You should add axis=1 in apply
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'], x['n'], freq='d'), axis=1)
I came up with a solution that works (but I'm sure there's a nicer way...)
# define target
tgt = [dt.datetime(2018,8,26)]
# find max n
max_n = max(df['n'])
# create that many columns and increment the day
for i in range(max_n):
df['date_{}'.format(i)] = df['date'] + dt.timedelta(days=i)
new_cols = ['date_{}'.format(n) for n in range(max_n)]
# check each one and replace with a 1 if it matches the "tgt"
df['target_date'] = 0
for col in new_cols:
df['target_date'] = np.where(df[col].isin(tgt),
1,
df['target_date'])
# drop intermediate cols
df = df[[i for i in df.columns if not i in new_cols]]
Related
I have a table like this (with more columns):
date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825
I have created some features like this:
sectorGroup = df.groupby(["date","Sector"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby(["date"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,dateGroupGroup,on=["date"],how="left",suffixes=["","_byDate"])
Now my new df looks like this:
date,Sector,Value1,Value2,Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
14/03/22,Medical,86,64,275.0,81.5,281.75,260.25
14/03/22,Medical,464,99,275.0,81.5,281.75,260.25
14/03/22,Industry,22,35,22.0,35.0,281.75,260.25
14/03/22,Services,555,843,555.0,843.0,281.75,260.25
15/03/22,Services,111,533,111.0,533.0,1634.75,616.0
15/03/22,Industry,222,169,222.0,169.0,1634.75,616.0
15/03/22,Medical,672,937,3103.0,881.0,1634.75,616.0
15/03/22,Medical,5534,825,3103.0,881.0,1634.75,616.0
Now, I want to create lag features for Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
For example, a new column named Value1_by_Date_lag1 and Value1_bySector_lag1.
And this new column will look like this:
date,Sector,Value1_by_Date_lag1,Value1_bySector_lag1
15/03/22,Services,281.75,555.0
15/03/22,Industry,281.75,22.0
15/03/22,Medical,281.75,275.0
15/03/22,Medical,281.75,275.0
Basically in Value1_by_Date_lag1, the date "15/03" will contain the value "281.75" which is for the date "14/03" (lag of 1 shift).
Basically in Value1_bySector_lag1, the date "15/03" and Sector "Medical" will contain the value "275.0", which is the value for "14/03" and "Medical" rows.
I hope, the question is clear and gave you all the details.
Create a lagged date variable by shifting the date column, and then merge again with dateGroupGroup and sectorGroup using the lagged date instead of the actual date.
df = pd.read_csv(io.StringIO("""date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825"""))
# Add a lagged date variable
lagged = df.groupby("date")["date"].first().shift()
df = df.join(lagged, on="date", rsuffix="_lag")
# Create date and sector groups and merge them into df, as you already do
sectorGroup = df.groupby(["date","Sector"])[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby("date")[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df, dateGroupGroup, on="date",how="left", suffixes=["","_byDate"])
# Merge again, this time matching the lagged date in df to the actual date in sectorGroup and dateGroupGroup
df = pd.merge(df, sectorGroup, left_on=["date_lag", "Sector"], right_on=["date", "Sector"], how="left", suffixes=["", "_by_sector_lag"])
df = pd.merge(df, dateGroupGroup, left_on="date_lag", right_on="date", how="left", suffixes=["", "_by_date_lag"])
# Drop the extra unnecessary columns that have been created in the merge
df = df.drop(columns=['date_by_date_lag', 'date_by_sector_lag'])
This assumes the data is sorted by date - if not you will have to sort before generating the lagged date. It will work whether or not all the dates are consecutive.
I found 1 inefficient solution (slow and memory intensive).
Lag of "date" group
cols = ["Value1_byDate","Value2_byDate"]
temp = df[["date"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp.date = temp.date.shift(-1-i)
df = pd.merge(df,temp,on="date",how="left",suffixes=["","_lag"+str(i+1)])
Lag of "date" and "Sector" group
cols = ["Value1_bySector","Value2_bySector"]
temp = df[["date","Sector"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp[["Value1_bySector","Value2_bySector"]] = temp.groupby("Sector")["Value1_bySector","Value2_bySector"].shift(1+1)
df = pd.merge(df,temp,on=["date","Sector"],how="left",suffixes=["","_lag"+str(i+1)])
Is there a more simple solution?
I am trying to write a script that will add 4 months or 8 months to the column titled "Date" depending on the column titled "Quarterly_Call__c". For instance, if the value in Quarterly_Call__c = 2 then add 4 months to the "Date" column and if the value is 3, add 8 months. Finally, I want the output in the column titled "New Date".
So far I am able to add the number of months I want using this piece of code:
from datetime import date
from dateutil.relativedelta import relativedelta
new_date = []
df['Date'] = df['Date'].dt.normalize()
for value in df['Date']:
new_date.append(value + relativedelta(months=+4))
df['New Date'] = new_date
However, as I mentioned, I would like this to work depending on the value in Quarterly_Call__c, so I tried writing this code:
for i in df['Quarterly_Call__c'].astype(int).to_list():
if i == 2:
for value in df['Date']:
new_date.append(value + relativedelta(months=+4))
elif i == 3:
for value in df['Date']:
new_date.append(value + relativedelta(months=+8))
Unfortunately, this does not work. Could you please recommend a solution? Thanks!
Using lambda expressions to each of the rows on your DataFrame seems to be the most convenient approach:
def date_calc(q,d):
if q == 2:
return d + relativedelta(months=+4)
else:
return d + relativedelta(months=+8)
df['New Date'] = df.apply(lambda x: date_calc(x['Quarterly_Call__c'], x['Date']), axis=1)
The date_calc function holds the same logic you posted in your question while taking the inputs as arguments, and the apply method of the DataFrame is used to calculate the 'New Date' column for each row where the variable x of the lambda expression represents a row of the DataFrame.
Keep in mind that the axis argument being specified to 1 is what makes sure that the function is applied for each row of the DataFrame rather than each column. More info about the apply method can be found here.
You could iterate through row by row to access the row data, and calculate the new date.
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame({
'Quarterly_Call__c': [2,3,2,3],
'Date': ['2021-02-25', '2021-03-25', '2021-04-25', '2021-05-25']
})
df['Date'] = pd.to_datetime(df['Date'])
df['New Date'] = '' #new empty column
for i in range(len(df)):
if df.loc[i, 'Quarterly_Call__c'] == 2:
df.loc[i, 'New Date'] = df.loc[i, 'Date'] + relativedelta(months=+4)
if df.loc[i, 'Quarterly_Call__c'] == 3:
df.loc[i, 'New Date'] = df.loc[i, 'Date'] + relativedelta(months=+8)
df['New Date'] = df['New Date'].dt.normalize()
Output
Quarterly_Call__c Date New Date
0 2 2021-02-25 2021-06-25
1 3 2021-03-25 2021-11-25
2 2 2021-04-25 2021-08-25
3 3 2021-05-25 2022-01-25
You can try lambda functions on your dataframe. For example:
import pandas as pd
numbers = {'set_of_numbers': [1,2,3,4,5,6,7,8,9,10]}
df = pd.DataFrame(numbers,columns=['set_of_numbers'])
df['equal_or_lower_than_4?'] = df['set_of_numbers'].apply(lambda x: 'True' if x <= 4 else 'False')
print (df)
You can check this link [1] for more information on how to apply if conditions on Pandas DataFrame.
I have to 2 dfs:
dfMiss:
and
dfSuper:
I need to create a final output that summarises the data in the 2 tables which I am able to do shown in the code below:
dfCity = dfSuper \
.groupby(by='City').count() \
.drop(columns='Superhero ID') \
.rename(columns={'Superhero': 'Total count'})
print("This is the df city : ")
print(dfCity)
## Convert column MissionEndDate to DateTime format
for df in dfMiss:
# Dates are interpreted as MM/dd/yyyy by default, dayfirst=False
df['Mission End date'] = pd.to_datetime(df['Mission End date'], dayfirst=True)
# Get Year and Quarter, given Q1 2020 starts in April
date = df['Mission End date'] - pd.DateOffset(months=3)
df['Mission End quarter'] = date.dt.year.astype(str) + ' Q' + date.dt.quarter.astype(str)
## Get no. Superheros working per City per Quarter
dfCount = []
for dfM in dfMiss:
# Merge DataFrames
df = dfSuper.merge(dfM, left_on='Superhero ID', right_on='SID')
df = df.pivot_table(index=['City', 'Superhero'], columns='Mission End quarter', aggfunc='nunique')
# Get the first group (all the groups have the same values)
df = df[df.columns[0][0]]
# Group the values by City (effectively "collapsing" the 'Superhero' column)
df = df.groupby(by=['City']).count()
dfCount += [df]
## Get no. Superheros available per City per Quarter
dfFree = []
for dfC in dfCount:
# Merge DataFrames
df = dfCity.merge(right=dfC, on='City', how='outer').fillna(0) # convert NaN values to 0
# Subtract no. working superheros from total no. superheros per city
for col in df.columns[1:]:
df[col] = df['Total count'] - df[col]
dfFree += [df.astype(int)]
print(dfFree)
dfResult = pd.DataFrame(dfFree)
The problem is when I try to convert DfFree into a dataframe I get the error:
"ValueError: Must pass 2-d input. shape=(1, 4, 5) "
The line that raises the error is
dfResult = pd.DataFrame(dfFree)
Anyone have any idea what this means and how I can convert the list into a df?
Thanks :)
separate your code using SOLID. separation of concerns. It is not easy to read
sid=[665544,665544,2121,665544,212121,123456,666666]
mission_end_date=["10/10/2020", "03/03/2021", "02/02/2021", "05/12/2020", "15/07/2021", "03/06/2021", "12/10/2020"]
superherod_sid=[212121,364331,678523,432432,665544,123456,555555,666666,432432]
hero=["Spiderman","Ironman","Batman","Dr. Strange","Thor","Superman","Nightwing","Loki","Wolverine"]
city=["New York","New York","Gotham","New York","Asgard","Metropolis","Gotham","Asgard","New York"]
df_mission=pd.DataFrame({'sid':sid,'mission_end_date':mission_end_date})
df_super=pd.DataFrame({'sid':superherod_sid,'hero':hero, 'city':city})
df=df_super.merge(df_mission,on="sid", how="left")
df['mission_end_date']=pd.to_datetime(df['mission_end_date'])
df['mission_end_date_quarter']=df['mission_end_date'].dt.quarter
df['mission_end_date_year']=df['mission_end_date'].dt.year
print(df.head(20))
pivot = df.pivot_table(index=['city', 'hero'], columns='mission_end_date_quarter', aggfunc='nunique').fillna(0)
print(pivot.head())
I have DataFrame like below:
df = pd.DataFrame({"data" : ["25.01.2020", and many more other dates...]})
df["data"] = pd.to_datetime(df["data"], format = "%d%m%Y")
And I have a series of special dates like below:
special_date = pd.Series(pd.to_datetime(["16.01.2020",
"27.01.2020",
and many more other dates...], dayfirst=True))
And I need to calculate 2 more columns in this DataFrame:
col1 = number of weeks to the next special date
col2 = number of weeks after las special date
So I need results like below:
col1 = 1 because next special date after 25.01 is 27.01 so it is the same week
col2 = 2 because last special date before 25.01 is 16.01 so i is 2 weeks ago
*please be aware that I have many more dates, so code needs to work for more dates than only 2 special dates or only 1 data from df.
You can use broadcasting to create a matrix of time deltas and than calculate the minima for your new columns
import numpy as np, pandas as pd
df = pd.DataFrame({'data': pd.to_datetime(["01.01.2020","25.01.2020","20.02.2020"], dayfirst=True)})
s = pd.Series(pd.to_datetime(["16.01.2020","27.01.2020","08.02.2020","19.02.2020"], dayfirst=True))
delta = (s.to_numpy()[:,None] - df['data'].to_numpy()).astype('timedelta64[D]') / np.timedelta64(1, 'D')
n = np.min( delta, 0, where=delta> 0, initial=np.inf)
p = np.min(-delta, 0, where=delta<=0, initial=np.inf)
df['next'] = np.ceil(n/7) #consider np.floor
df['prev'] = np.ceil(p/7)
Alternatively to using the where argument you could perform the steps by hand:
n = delta.copy(); n[delta<=0] = np.inf; n = np.abs(np.min(n,0))
p = delta.copy(); p[delta> 0] = -np.inf; p = np.abs(np.min(-p,0))
Happy 2020! I would like to create a dataframe based on two others. I have the below two dataframes:
df1 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'],'A': [63.63,64.08,64.19,65.11,65.36,65.25,65.36], 'B': [63.83, 64.10, 64.19, 65.08, 65.33, 65.28, 65.36], 'C':[63.99, 64.22, 64.30, 65.16, 65.41, 65.36, 65.44]})
df2 = pd.DataFrame({'Name':['A','B','C'],'Notice': ['05.05.1982','07.05.1982','12.05.1982']})
The idea is to create df3 such that this dataframe takes the value of A until A's notice date (found in df2) is reached, then df3 switches to the values of B until B's notice date is reached and so on. When we are during notice date, it should take the mean between the current column and the next one.
In the above example, df3 should be as follows (with formulas to illustrate):
df3 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'], 'Result':[63.63,64.08,(64.19+64.19)/2,65.08,(65.33+65.41)/2,65.36,65.44]})
My idea was to first create a temporary dataframe with same dimensions as df1 and to fill it with 1's when the index date is prior to notice and 0's after. Doing a rolling mean with window 1 would give for each column a series of 1 until I reach 0.5 (signalling a switch).
Not sure if there is a better way to get df3?
I tried the following:
def fill_rule(df_p,df_t):
return np.where(df_p.index > df_t[df_t.Name==df_p.name]['Notice'][0], 0, 1)
df1['date'] = pd.to_datetime(df1['date'])
df2['notice'] = pd.to_datetime(df2['notice'])
df1.set_index("date", inplace = True)
temp = df1.apply(lambda x: fill_rule(x, df2), axis = 0)
And I got the following error: KeyError: (0, 'occurred at index B')
df1['t'] = df1['date'].map(df2.set_index(["Notice"])['Name'])
df1['t'] =df1['t'].fillna(method='bfill').fillna("C")
df3 = pd.DataFrame()
df3['Result'] = df1.apply(lambda row: row[row['t']],axis =1)
df3['date'] = df1['date']
You can use the between method to select the specific date ranges in both dataframes and then use iloc to substitute the specific values
#Initializing the output
df3 = df1.copy()
df3.drop(['B','C'], axis = 1, inplace = True)
df3.columns = ['date','Result']
df3['Result'] = 0.0
df3['count'] = 0
#Modifying df2 to add a dummy sample at the beginning
temp = df2.copy()
temp = temp.iloc[0]
temp = pd.DataFrame(temp).T
temp.Name ='Z'
temp.Notice = pd.to_datetime("05-05-1980")
df2 = pd.concat([temp,df2])
for i in range(len(df2)-1):
startDate = df2.iloc[i]['Notice']
endDate = df2.iloc[i+1]['Notice']
name = df2.iloc[i+1]['Name']
indices = [df1.date.between(startDate, endDate, inclusive=True)][0]
df3.loc[indices,'Result'] += df1[indices][name]
df3.loc[indices,'count'] += 1
df3.Result = df3.apply(lambda x : x.Result/x['count'], axis = 1)