Python pandas add a years integer column to a date column - python

I have a question somehow similar to what discussed here How to add a year to a column of dates in pandas
however in my case, the number of years to add to the date column is stored in another column. This is my not working code:
import datetime
import pandas as pd
df1 = pd.DataFrame( [ ["Tom",5], ['Jane',3],['Peter',1]], columns = ["Name","Years"])
df1['Date'] = datetime.date.today()
df1['Final_Date'] = df1['Date'] + pd.offsets.DateOffset(years=df1['Years'])
The goal is to add 5 years to the current date for row 1, 3 years to current date in row 2, eccetera.
Any suggestions? Thank you

Convert to time delta by converting years to days, then adding to a converted datetime column:
df1['Final_Date'] = pd.to_datetime(df1['Date']) \
+ pd.to_timedelta(df1['Years'] * 365, unit='D')
Use of to_timedelta with unit='Y' for years is deprecated and throws ValueError.
Edit. If you need day-exact changes, you will need to go row-by-row and update the date objects accordingly. Other answers explain.

Assuming the number of different values in Years is limited, you can try groupby and do the operation with pd.DateOffset like:
df1['new_date'] = (
df1.groupby('Years')
['Date'].apply(lambda x: x + pd.DateOffset(years=x.name))
)
print(df1)
Name Years Date new_date
0 Tom 5 2021-07-13 2026-07-13
1 Jane 3 2021-07-13 2024-07-13
2 Peter 1 2021-07-13 2022-07-13
else you can extract year, month and day, add the Years column to year and recreate a datetime column
df1['Date'] = pd.to_datetime(df1['Date'])
df1['new_date'] = (
df1.assign(year=lambda x: x['Date'].dt.year+x['Years'],
month=lambda x: x['Date'].dt.month,
day=lambda x: x['Date'].dt.day,
new_date=lambda x: pd.to_datetime(x[['year','month','day']]))
['new_date']
)
same result

import datetime
import pandas as pd
df1 = pd.DataFrame( [ ["Tom",5], ['Jane',3],['Peter',1]], columns = ["Name","Years"])
df1['Date'] = datetime.date.today()
df1['Final_date'] = datetime.date.today()
df1['Final_date'] = df1.apply(lambda g: g['Date'] + pd.offsets.DateOffset(years = g['Years']), axis=1)
print(df1)
Try this, you were trying to add the whole column when you called pd.offsets.DateOffset(years=df1['Years']) instead of just 1 value in the column.
EDIT: I changed from iterrows to a vectorization method due to iterrows's poor performance

Related

Convert 44710.37680 to readable timestamp [duplicate]

This question already has answers here:
Convert Excel style date with pandas
(3 answers)
Closed 7 months ago.
I'm having a hard time converting what is supposed to be a datetime column from an excel file. When opening it with pandas I get 44710.37680 instead of 5/29/2022 9:02:36. I tried this peace of code to convert it.
df = pd.read_excel(file,'Raw')
df.to_csv(finalfile, index = False)
df = pd.read_csv(finalfile)
df['First LogonTime'] = df['First LogonTime'].apply(lambda x: pd.Timestamp(x).strftime('%Y-%m-%d %H:%M:%S'))
print(df)
And the result I get is 1970-01-01 00:00:00 :c
Don't know if this helps but its an .xlsb file that I'm working with.
You can use unit='d' (for days) and substract 70 years:
pd.to_datetime(44710.37680, unit='d') - pd.DateOffset(years=70)
Result:
Timestamp('2022-05-30 09:02:35.520000')
For dataframes use:
import pandas as pd
df = pd.DataFrame({'First LogonTime':[44710.37680, 44757.00000]})
df['First LogonTime'] = pd.to_datetime(df['First LogonTime'], unit='d') - pd.DateOffset(years=70)
Or:
import pandas as pd
df = pd.DataFrame({'First LogonTime':[44710.37680, 44757.00000]})
df['First LogonTime'] = df['First LogonTime'].apply(lambda x: pd.to_datetime(x, unit='d') - pd.DateOffset(years=70))
Result:
First LogonTime
0 2022-05-30 09:02:35.520
1 2022-07-16 00:00:00.000

Expand values with dates in a pandas dataframe

I have a dataframe with name values and a date range (start/end). I need to expand/replace the dates with the ones generated by the from/to index. How can I do this?
Name date_range
NameOne_%Y%m-%d [-2,1]
NameTwo_%y%m%d [-3,1]
Desired result (Assuming that today's date is 2021-03-09 - 9 of march 2021):
Name
NameOne_202103-10
NameOne_202103-09
NameOne_202103-08
NameOne_202103-07
NameTwo_210310
NameTwo_210309
NameTwo_210308
NameTwo_210307
NameTwo_210306
I've been trying iterating over the dataframe and then generating the dates, but I still can't make it work..
for index, row in self.config_df.iterrows():
print(row['source'], row['date_range'])
days_sub=int(str(self.config_df["date_range"][0]).strip("[").strip("]").split(",")[0].strip())
days_add=int(str(self.config_df["date_range"][0]).strip("[").strip("]").split(",")[1].strip())
start_date = date.today() + timedelta(days=days_sub)
end_date = date.today() + timedelta(days=days_add)
date_range_df=pd.date_range(start=start_date, end=end_date)
date_range_df["source"]=row['source']
Any help is appreciated. Thanks!
Convert your date_range from str to list with ast module:
import ast
df = df.assign(date_range=df["date_range"].apply(ast.literal_eval)
Use date_range to create list of dates and explode to chain the list:
today = pd.Timestamp.today().normalize()
offset = pd.tseries.offsets.Day # shortcut
names = pd.Series([pd.date_range(today + offset(end),
today + offset(start),
freq="-1D").strftime(name)
for name, (start, end) in df.values]).explode(ignore_index=True)
>>> names
0 NameOne_202103-10
1 NameOne_202103-09
2 NameOne_202103-08
3 NameOne_202103-07
4 NameTwo_210310
5 NameTwo_210309
6 NameTwo_210308
7 NameTwo_210307
8 NameTwo_210306
dtype: object
Alright. From your question I understand you have a starting data frame like so:
config_df = pd.DataFrame({
'name': ['NameOne_%Y-%m-%d', 'NameTwo_%y%m%d'],
'str_date_range': ['[-2,1]', '[-3,1]']})
Resulting in this:
name str_date_range
0 NameOne_%Y-%m-%d [-2,1]
1 NameTwo_%y%m%d [-3,1]
To achieve your goal and avoid iterating rows - which should be avoided using pandas - you can use groupby().apply() like so:
def expand(row):
# Get the start_date and end_date from the row, by splitting
# the string and taking the first and last value respectively.
# .min() is required because row is technically a pd.Series
start_date = row.str_date_range.str.strip('[]').str.split(',').str[0].astype(int).min()
end_date = row.str_date_range.str.strip('[]').str.split(',').str[1].astype(int).min()
# Create a list range for from start_date to end_date.
# Note that range() does not include the end_date, therefor add 1
day_range = range(start_date, end_date+1)
# Create a Timedelta series from the day_range
days_diff = pd.to_timedelta(pd.Series(day_range), unit='days')
# Create an equally sized Series of today Timestamps
todays = pd.Series(pd.Timestamp.today()).repeat(len(day_range)-1).reset_index(drop=True)
df = todays.to_frame(name='date')
# Add days_diff to date column
df['date'] = df.date + days_diff
df['name'] = row.name
# Extract the date format from the name
date_format = row.name.split('_')[1]
# Add a column with the formatted date using the date_format string
df['date_str'] = df.date.dt.strftime(date_format=date_format)
df['name'] = df.name.str.split('_').str[0] + '_' + df.date_str
# Optional: drop columns
return df.drop(columns=['date'])
config_df.groupby('name').apply(expand).reset_index(drop=True)
returning:
name date_str
0 NameOne_2021-03-07 2021-03-07
1 NameOne_2021-03-08 2021-03-08
2 NameOne_2021-03-09 2021-03-09
3 NameTwo_210306 210306
4 NameTwo_210307 210307
5 NameTwo_210308 210308
6 NameTwo_210309 210309

Select nearest date first day of month in a python dataframe

i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541

Convert Stacked DataFrame of Years and Months to DataFrame with Datetime Indices

I am reading a csv file of the number of employees in the US by year and month (in thousands). It starts out like this:
Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
1961,45119,44969,45051,44997,45119,45289,45400,45535,45591,45716,45931,46035
1962,46040,46309,46375,46679,46668,46644,46720,46775,46888,46927,46910,46901
1963,46912,47000,47077,47316,47328,47356,47461,47542,47661,47805,47771,47863
...
I want my Pandas Dataframe to have the datetime as the index for each month's value. I'm doing this so I can later add values for specific time ranges. I want it to look something like this:
1961-01-01 45119.0
1961-02-01 44969.0
1961-03-01 45051.0
1961-04-01 44997.0
1961-05-01 45119.0
...
I did some research and thought that if I stacked the years and months together, I could combine them into a datetime. Here is what I have done:
import pandas as pd
import numpy as np
df = pd.read_csv("BLS_private.csv", header=5, index_col="Year")
df.columns = range(1, 13) # I transformed months into numbers 1-12 for easier datetime conversion
df = df.stack() # Months are no longer columns
print(df)
Here is my output:
Year
1961 1 45119.0
2 44969.0
3 45051.0
4 44997.0
5 45119.0
...
I do not know how to combine the year and the months in the stacked indices. Does stacking the indices help at all in my case? I am also not the most familiar with Pandas datetime, so any explanation about how I could use that would be very helpful. Also if anyone has alternate solutions than making datetime the index, I welcome ideas.
After the stack create the DateTimeIndex from the current index
from datetime import datetime
dt_index = pd.to_datetime([datetime(year=year, month=month, day=1)
for year, month in df.index.values])
df.index = dt_index
df.head(3)
# 1961-01-01 45119
# 1961-02-01 44969
# 1961-03-01 45051
import pandas as pd
df = pd.read_csv("BLS_private.csv", index_col="Year")
dates = pd.date_range(start=str(df.index[0]), end=str(df.index[-1] + 1), closed='left', freq="MS")
df = df.stack()
df.index = dates
df.to_frame()
s = """Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
1961,45119,44969,45051,44997,45119,45289,45400,45535,45591,45716,45931,46035
1962,46040,46309,46375,46679,46668,46644,46720,46775,46888,46927,46910,46901
1963,46912,47000,47077,47316,47328,47356,47461,47542,47661,47805,47771,47863"""
df = pd.read_csv(StringIO(s))
# set index and stack
stack = df.set_index('Year').stack().reset_index()
# create a new index
stack.index = pd.to_datetime(stack['Year'].astype(str) +'-'+ stack['level_1'])
# remove columns
final = stack[0].to_frame()
1961-01-01 45119
1961-02-01 44969
1961-03-01 45051
1961-04-01 44997
1961-05-01 45119
1961-06-01 45289

How to convert date format QQ-YYYY to a datetime object [duplicate]

This question already has answers here:
Convert Pandas Column to DateTime
(8 answers)
Closed 4 years ago.
I have a pandas dataframe with a column that should indicate the end of a financial quarter. The format is of the type "Q1-2009". Is there a quick way to convert these strings into a timestamp as "2009-03-31"?
I have found only the conversion from the format "YYYY-QQ", but not the opposite.
Create quarters periods with swap quarter and year part by replace and convert to datetimes with PeriodIndex.to_timestamp:
df = pd.DataFrame({'per':['Q1-2009','Q3-2007']})
df['date'] = (pd.PeriodIndex(df['per'].str.replace(r'(Q\d)-(\d+)', r'\2-\1'), freq='Q')
.to_timestamp(how='e'))
print (df)
per date
0 Q1-2009 2009-03-31
1 Q3-2007 2007-09-30
Another solution is use string indexing:
df['date'] = (pd.PeriodIndex(df['per'].str[-4:] + df['per'].str[:2], freq='Q')
.to_timestamp(how='e'))
One solution using a list comprehension followed by pd.offsets.MonthEnd:
# data from #jezrael
df = pd.DataFrame({'per':['Q1-2009','Q3-2007']})
def get_values(x):
''' Returns string with quarter number multiplied by 3 '''
return f'{int(x[0][1:])*3}-{x[1]}'
values = [get_values(x.split('-')) for x in df['per']]
df['LastDay'] = pd.to_datetime(values, format='%m-%Y') + pd.offsets.MonthEnd(1)
print(df)
per LastDay
0 Q1-2009 2009-03-31
1 Q3-2007 2007-09-30

Categories

Resources