I have DataFrame like below:
df = pd.DataFrame({"ID" : ["1", "2", "3"],
"Date" : ["12/11/2020", "12/10/2020", "05/04/2020"]})
And I need to calculate number of MONTHS from Date column until today. Below I upload result which I need:
You can modify this solution for subtract by scalar d:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
d = pd.to_datetime('now')
df['Amount'] = 12 * (d.year - df['Date'].dt.year) + d.month - df['Date'].dt.month
print (df)
ID Date Amount
0 1 2020-11-12 1
1 2 2020-10-12 2
2 3 2020-04-05 8
Try using this code that subtracts the time now with the 'Date' column, I also use np.ceil, because that rounds up a number:
df['Date'] = pd.to_datetime(df['Date'])
df['Amount'] = ((pd.to_datetime('now') - df['Date']) / np.timedelta64(1, 'M')).apply(np.ceil)
print(df)
from datetime import datetime
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID" : ["1", "2", "3"],
"Date" : ["12/11/2020", "12/10/2020", "05/04/2020"]})
df['Month_diff'] = round(((datetime.now() - pd.to_datetime(df.Date,infer_datetime_format=True,dayfirst=True))/np.timedelta64(1, 'M'))-0.5)
This would be a one-liner where you are transforming the column Date to datetimeformat and then performing the operation. Output:
ID Date Month_diff
0 1 12/11/2020 1.0
1 2 12/10/2020 2.0
2 3 05/04/2020 8.0
Related
import yfinance as yf
import numpy as np
import pandas as pd
ETF_DB = ['QQQ', 'EGFIX']
fundsret = yf.download(ETF_DB, start=datetime.date(2020,12,31), end=datetime.date(2022,4,30), interval='1mo')['Adj Close'].pct_change()
df = pd.DataFrame(fundsret)
df
Gives me:
I'm trying to remove the rows in the dataframe that aren't month end such as the row 2021-03-22. How do I have the dataframe go through and remove the rows where the date doesn't end in '01'?
df.reset_index(inplace=True)
# Convert the date to datetime64
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
#select only day = 1
filtered = df.loc[df['Date'].dt.day == 1]
Did you mean month start?
You can use:
df = df[df.index.day==1]
reproducible example:
df = pd.DataFrame(columns=['A', 'B'],
index=['2021-01-01', '2021-02-01', '2021-03-01',
'2021-03-22', '2021-03-31'])
df.index = pd.to_datetime(df.index, dayfirst=False)
output:
A B
2021-01-01 NaN NaN
2021-02-01 NaN NaN
2021-03-01 NaN NaN
end of month
for the end of month, you can add 1 day and check if this jumps to the next month:
end = (df.index+pd.Timedelta('1d')).month != df.index.month
df = df[end]
or add an offset and check if the value is unchanged:
end = df.index == (df.index + pd.offsets.MonthEnd(0))
df = df[end]
output:
A B
2021-03-31 NaN NaN
import pandas as pd
import re
# Dummy Dictionary
dict={
'Date': ['2021-01-01','2022-03-01','2023-04-22','2023-04-01'],
'Name' : ['A','B','C','D']
}
# Making a DataFrame
df=pd.DataFrame(dict)
# Date Pattern Required
pattern= '(\d{4})-(\d{2})-01'
new_df=df[df['Date'].str.match(r'((\d{4})-(\d{2})-01)')]
print(new_df)
My data frame contains a IGN_DATE column in which the values are of the form 20080727142700, format is YYYYMMDDHHMMSS.
The column type is float64.
How can I get the a separate column for time, date (without 00:00:00), day, month.
What I tried:
Column name IGN_DATE
dataframe - df
df['IGN_DATE'] = df['IGN_DATE'].apply(str)
df['DATE'] = pd.to_datetime(df['IGN_DATE'].str.slice(start = 0, stop = 8))
df['MONTH'] = df['IGN_DATE'].str.slice(start = 4, stop = 6).astype(int)
df['DAY'] = df['IGN_DATE'].str.slice(start = 6, stop = 8).astype(int)
df['TIME'] = df['IGN_DATE'].str.slice(start = 8, stop = 13)
DATE is in the format YYYY-MM-DD 00:00:00. I don't want 00:00:00 in DATE.
How to get the time--which has type string--to HH:MM:SS ?
Is there any simpler way to do this?
If nan values are not important can dropna then convert to_datetime with a specified format then use the dt accessor to access desired values:
# Drop Rows with nan in IGN_DATE column
df = df.dropna(subset=['IGN_DATE'])
# Convert dtype to whole number then to `str`
df['IGN_DATE'] = df['IGN_DATE'].astype('int64').astype(str)
# Series of datetime values from Column
s = pd.to_datetime(df['IGN_DATE'], format='%Y%m%d%H%M%S')
# Extract out and add to DataFrame from `s`
df['DATE'] = s.dt.date
df['MONTH'] = s.dt.month
df['DAY'] = s.dt.day
df['TIME'] = s.dt.time
Otherwise can mask notna values from IGN_DATE and assign only those rows:
# Mask not null values
m = df['IGN_DATE'].notna()
# Convert to String
df.loc[m, 'IGN_DATE'] = df.loc[m, 'IGN_DATE'].astype('int64').astype(str)
# Series of datetime values from Column
s = pd.to_datetime(df['IGN_DATE'], format='%Y%m%d%H%M%S')
# Extract out and add to DataFrame from `s`
df.loc[m, 'DATE'] = s.dt.date
df.loc[m, 'MONTH'] = s.dt.month
df.loc[m, 'DAY'] = s.dt.day
df.loc[m, 'TIME'] = s.dt.time
Sample DF:
import numpy as np
import pandas as pd
df = pd.DataFrame({'IGN_DATE': [20080727142700, np.nan, 20151015171807]})
Sample Output with dropna:
IGN_DATE DATE MONTH DAY TIME
0 20080727142700 2008-07-27 7 27 14:27:00
2 20151015171807 2015-10-15 10 15 17:18:07
Sample Output with mask:
IGN_DATE DATE MONTH DAY TIME
0 20080727142700 2008-07-27 7.0 27.0 14:27:00
1 NaN NaN NaN NaN NaN
2 20151015171807 2015-10-15 10.0 15.0 17:18:07
I have two pandas dataframes. I would like to keep all rows in df2 where Type is equal to Type in df1 AND Date is between Date in df1 (- 1 day or + 1 day). How can I do this?
df1
IBSN Type Date
0 1 X 2014-08-17
1 1 Y 2019-09-22
df2
IBSN Type Date
0 2 X 2014-08-16
1 2 D 2019-09-22
2 9 X 2014-08-18
3 3 H 2019-09-22
4 3 Y 2019-09-23
5 5 G 2019-09-22
res
IBSN Type Date
0 2 X 2014-08-16 <-- keep because Type = df1[0]['Type'] AND Date = df1[0]['Date'] - 1
1 9 X 2014-08-18 <-- keep because Type = df1[0]['Type'] AND Date = df1[0]['Date'] + 1
2 3 Y 2019-09-23 <-- keep because Type = df1[1]['Type'] AND Date = df1[1]['Date'] + 1
This should do it:
import pandas as pd
from datetime import timedelta
# create dummy data
df1 = pd.DataFrame([[1, 'X', '2014-08-17'], [1, 'Y', '2019-09-22']], columns=['IBSN', 'Type', 'Date'])
df1['Date'] = pd.to_datetime(df1['Date']) # might not be necessary if your Date column already contain datetime objects
df2 = pd.DataFrame([[2, 'X', '2014-08-16'], [2, 'D', '2019-09-22'], [9, 'X', '2014-08-18'], [3, 'H', '2019-09-22'], [3, 'Y', '2014-09-23'], [5, 'G', '2019-09-22']], columns=['IBSN', 'Type', 'Date'])
df2['Date'] = pd.to_datetime(df2['Date']) # might not be necessary if your Date column already contain datetime objects
# add date boundaries to the first dataframe
df1['Date_from'] = df1['Date'].apply(lambda x: x - timedelta(days=1))
df1['Date_to'] = df1['Date'].apply(lambda x: x + timedelta(days=1))
# merge the date boundaries to df2 on 'Type'. Filter rows where date is between
# data_from and date_to (inclusive). Drop 'date_from' and 'date_to' columns
df2 = df2.merge(df1.loc[:, ['Type', 'Date_from', 'Date_to']], on='Type', how='left')
df2[(df2['Date'] >= df2['Date_from']) & (df2['Date'] <= df2['Date_to'])].\
drop(['Date_from', 'Date_to'], axis=1)
Note that according to your logic, row 4 in df2 (3 Y 2014-09-23) should not remain as its date (2014) is not in between the given dates in df1 (year 2019).
Assume Date columns in both dataframes are already in dtype datetime. I would construct IntervalIndex to assign to index of df1. Map columns Type of df1 to df2. Finally check equality to create mask to slice
iix = pd.IntervalIndex.from_arrays(df1.Date + pd.Timedelta(days=-1),
df1.Date + pd.Timedelta(days=1), closed='both')
df1 = df1.set_index(iix)
s = df2['Date'].map(df1.Type)
df_final = df2[df2.Type == s]
Out[1131]:
IBSN Type Date
0 2 X 2014-08-16
2 9 X 2014-08-18
4 3 Y 2019-09-23
I have a multi-index dataframe that look like this:
In[13]: df
Out[13]:
Last Trade
Date Ticker
1983-03-30 CLM83 1983-05-18
CLN83 1983-06-17
CLQ83 1983-07-18
CLU83 1983-08-19
CLV83 1983-09-16
CLX83 1983-10-18
CLZ83 1983-11-18
1983-04-04 CLM83 1983-05-18
CLN83 1983-06-17
CLQ83 1983-07-18
CLU83 1983-08-19
CLV83 1983-09-16
CLX83 1983-10-18
CLZ83 1983-11-18
With two levels for indexes (namely 'Date' and 'Ticker'). I would like to apply a function to the column 'Last Trade' that would let me know how many months separate this 'Last Trade' date from the index 'Date'
I found a function that does the calculation:
from calendar import monthrange
def monthdelta(d1, d2):
delta = 0
while True:
mdays = monthrange(d1.year, d1.month)[1]
d1 += datetime.timedelta(days=mdays)
if d1 <= d2:
delta += 1
else:
break
return delta
I tried to apply the following function h but it returns me an AttributeError: 'Timestamp' object has no attribute 'index':
In[14]: h = lambda x: monthdelta(x.index.get_level_values(0),x)
In[15]: df['Last Trade'] = df['Last Trade'].apply(h)
How can I apply a function that would use both a column and an index value?
Thank you for your tips,
Use df.index.to_series().str.get(0) to get at first level of index.
(df['Last Trade'].dt.month - df.index.to_series().str.get(0).dt.month) + \
(df['Last Trade'].dt.year - df.index.to_series().str.get(0).dt.year) * 12
Date Ticker
1983-03-30 CLM83 2
CLN83 3
CLQ83 4
CLU83 5
CLV83 6
CLX83 7
CLZ83 8
1983-04-04 CLM83 1
CLN83 2
CLQ83 3
CLU83 4
CLV83 5
CLX83 6
CLZ83 7
dtype: int64
Timing
Given df
pd.concat([df for _ in range(10000)])
Try this instead of your function:
Option 1
You get an integer number
def monthdelta(row):
trade = row['Last Trade'].year*12 + row['Last Trade'].month
date = row['Date'].year*12 + row['Date'].month
return trade - date
df.reset_index().apply(monthdelta, axis=1)
Inspired by PiRsquared:
df = df.reset_index()
(df['Last Trade'].dt.year*12 + df['Last Trade'].dt.month) -\
(df['Date'].dt.year*12 + df['Date'].dt.month)
Option 2
You get a numpy.timedelta64
Which can be directly used for other date computations. However, this will be in the form of days, not months, because the number of days in a month are not constant.
def monthdelta(row):
return row['Last Trade'] - row['Date']
df.reset_index().apply(monthdelta, axis=1)
Inspired by PiRsquared:
df = df.reset_index()
df['Last Trade'] - df['Date']
Option 2 will of course be faster, because it involves less computations. Pick what you like!
To get your index back: df.index = df[['Date', 'Ticker']]
Is there a better way than bdate_range() to measure business days between two columns of dates via pandas?
df = pd.DataFrame({ 'A' : ['1/1/2013', '2/2/2013', '3/3/2013'],
'B': ['1/12/2013', '4/4/2013', '3/3/2013']})
print df
df['A'] = pd.to_datetime(df['A'])
df['B'] = pd.to_datetime(df['B'])
f = lambda x: len(pd.bdate_range(x['A'], x['B']))
df['DIFF'] = df.apply(f, axis=1)
print df
With output of:
A B
0 1/1/2013 1/12/2013
1 2/2/2013 4/4/2013
2 3/3/2013 3/3/2013
A B DIFF
0 2013-01-01 00:00:00 2013-01-12 00:00:00 9
1 2013-02-02 00:00:00 2013-04-04 00:00:00 44
2 2013-03-03 00:00:00 2013-03-03 00:00:00 0
Thanks!
brian_the_bungler was onto the most efficient way of doing this using numpy's busday_count:
import numpy as np
A = [d.date() for d in df['A']]
B = [d.date() for d in df['B']]
df['DIFF'] = np.busday_count(A, B)
print df
On my machine this is 300x faster on your test case, and 1000s of times faster on much larger arrays of dates
You can use pandas' Bday offset to step through business days between two dates like this:
new_column = some_date - pd.tseries.offsets.Bday(15)
Read more in this conversation: https://stackoverflow.com/a/44288696
It also works if some_date is a single date value, not a series.