Change year based on start and end date in dataframe - python

I had a column in data frame called startEndDate, example: '10.12-20.05.2019', divided those to columns start_date and end_date with same year, example: start_date '10.12.2019' and end_date '20.05.2019'. But year in this example is wrong, as it should be 2018 because start date cannot be after end date. How can I compare entire dataframe and replace values so it contains correct start_dates based on if statement(because some start dates should stay with year as 2019)?

This will show you which rows the start_date is > than the end date
data = {
'Start_Date' : ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'],
'End_Date' : ['2020-02-01', '2019-01-02', '2019-01-03', '2020-01-05']
}
df = pd.DataFrame(data)
df['Start_Date'] = pd.to_datetime(df['Start_Date'], infer_datetime_format=True)
df['End_Date'] = pd.to_datetime(df['End_Date'], infer_datetime_format=True)
df['Check'] = np.where(df['Start_Date'] > df['End_Date'], 'Error', 'No Error')
df
Without seeing more of your data or your intended final data this is the best we will be able to do to help identify problems in the data.

This method first splits up the date string to two dates and creates start and end date columns. Then it subtracts 1 year from the start date if it is greater than the end date.
import pandas as pd
import numpy as np
# mock data
df = pd.DataFrame({"dates": ["10.12-20.05.2019", "02.04-31.10.2019"]})
# split date string to two dates, convert to datetime and stack to columns
df[["start", "end"]] = np.vstack(
df.dates.apply(lambda x: pd.to_datetime(
[x.split("-")[0] + x[-5:],
x.split("-")[1]], format="%d.%m.%Y")))
# subtract 1 year from start date if greater than end date
df["start"] = np.where(df["start"]>df["end"],
df["start"] - pd.DateOffset(years=1),
df["start"])
df
# dates start end
#0 10.12-20.05.2019 2018-12-10 2019-05-20
#1 02.04-31.10.2019 2019-04-02 2019-10-31
Although I have used split here for the initial splitting of the string, as there will always be 5 characters before the hyphen, and the date will always be the last 5 (with the .), there is no need to use the split and instead that line could change to:
df[["start", "end"]] = np.vstack(
df.dates.apply(lambda x: pd.to_datetime(
[x[:5] + x[-5:],
x[6:]], format="%d.%m.%Y")))

Related

function to create new column with defined parameters

the problem is i can't think of a way to get the 'mark' from the last day of the previous month.
because I need to compare the current month with the previous month
It's a number generated every day, Mark_LastDayData takes as reference the mark of the last day of the month and replaces it in all values ​​of that same month. 'Mark_LastDayDate_PreviousMonth' 4 it would be like getting the 'Mark_LastDayData' from the previous month so I can make a comparison in the future
I have the following DF
import pandas as pd
from pandas.tseries.offsets import BMonthEnd
import datetime as dt
df = pd.DataFrame({'Found':['A','A','A','A','A','B','B','B'],
'Date':['14/10/2021','19/10/2021','29/10/2021','30/09/2021','20/09/2021','20/10/2021','29/10/2021','15/10/2021'],
#'LastDayMonth':['29/10/2021','29/10/2021','29/10/2021','30/09/2021','30/09/2021','29/10/2021','29/10/2021','29/10/2021'],
'Mark':[1,2,3,4,3,1,2,3]
})
print(df)
**LastDayMonth was obtained through the code
I made some changes to the date
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = pd.to_datetime(df['Date'], format = '%Y/%m/%d')
df['LastDayDate'] = pd.to_datetime(df['Date']) + BMonthEnd(0)
df['LastDayDatePrevMonth'] = pd.to_datetime(df['Date']) - pd.DateOffset(months=1)
I needed the 'Mark' of the last day of the month of each date so I used the method
df = df.merge(df.loc[df['Date'] == df['LastDayDate'], ['Found','LastDayDate','Mark']],
on=['Found', 'LastDayDate'],
how='left', suffixes=['', '_LastDayDate'])
How can I do this to get the 'mark' from the last day of the previous month
in the same column
Sample df that I filled in manually

Expand values with dates in a pandas dataframe

I have a dataframe with name values and a date range (start/end). I need to expand/replace the dates with the ones generated by the from/to index. How can I do this?
Name date_range
NameOne_%Y%m-%d [-2,1]
NameTwo_%y%m%d [-3,1]
Desired result (Assuming that today's date is 2021-03-09 - 9 of march 2021):
Name
NameOne_202103-10
NameOne_202103-09
NameOne_202103-08
NameOne_202103-07
NameTwo_210310
NameTwo_210309
NameTwo_210308
NameTwo_210307
NameTwo_210306
I've been trying iterating over the dataframe and then generating the dates, but I still can't make it work..
for index, row in self.config_df.iterrows():
print(row['source'], row['date_range'])
days_sub=int(str(self.config_df["date_range"][0]).strip("[").strip("]").split(",")[0].strip())
days_add=int(str(self.config_df["date_range"][0]).strip("[").strip("]").split(",")[1].strip())
start_date = date.today() + timedelta(days=days_sub)
end_date = date.today() + timedelta(days=days_add)
date_range_df=pd.date_range(start=start_date, end=end_date)
date_range_df["source"]=row['source']
Any help is appreciated. Thanks!
Convert your date_range from str to list with ast module:
import ast
df = df.assign(date_range=df["date_range"].apply(ast.literal_eval)
Use date_range to create list of dates and explode to chain the list:
today = pd.Timestamp.today().normalize()
offset = pd.tseries.offsets.Day # shortcut
names = pd.Series([pd.date_range(today + offset(end),
today + offset(start),
freq="-1D").strftime(name)
for name, (start, end) in df.values]).explode(ignore_index=True)
>>> names
0 NameOne_202103-10
1 NameOne_202103-09
2 NameOne_202103-08
3 NameOne_202103-07
4 NameTwo_210310
5 NameTwo_210309
6 NameTwo_210308
7 NameTwo_210307
8 NameTwo_210306
dtype: object
Alright. From your question I understand you have a starting data frame like so:
config_df = pd.DataFrame({
'name': ['NameOne_%Y-%m-%d', 'NameTwo_%y%m%d'],
'str_date_range': ['[-2,1]', '[-3,1]']})
Resulting in this:
name str_date_range
0 NameOne_%Y-%m-%d [-2,1]
1 NameTwo_%y%m%d [-3,1]
To achieve your goal and avoid iterating rows - which should be avoided using pandas - you can use groupby().apply() like so:
def expand(row):
# Get the start_date and end_date from the row, by splitting
# the string and taking the first and last value respectively.
# .min() is required because row is technically a pd.Series
start_date = row.str_date_range.str.strip('[]').str.split(',').str[0].astype(int).min()
end_date = row.str_date_range.str.strip('[]').str.split(',').str[1].astype(int).min()
# Create a list range for from start_date to end_date.
# Note that range() does not include the end_date, therefor add 1
day_range = range(start_date, end_date+1)
# Create a Timedelta series from the day_range
days_diff = pd.to_timedelta(pd.Series(day_range), unit='days')
# Create an equally sized Series of today Timestamps
todays = pd.Series(pd.Timestamp.today()).repeat(len(day_range)-1).reset_index(drop=True)
df = todays.to_frame(name='date')
# Add days_diff to date column
df['date'] = df.date + days_diff
df['name'] = row.name
# Extract the date format from the name
date_format = row.name.split('_')[1]
# Add a column with the formatted date using the date_format string
df['date_str'] = df.date.dt.strftime(date_format=date_format)
df['name'] = df.name.str.split('_').str[0] + '_' + df.date_str
# Optional: drop columns
return df.drop(columns=['date'])
config_df.groupby('name').apply(expand).reset_index(drop=True)
returning:
name date_str
0 NameOne_2021-03-07 2021-03-07
1 NameOne_2021-03-08 2021-03-08
2 NameOne_2021-03-09 2021-03-09
3 NameTwo_210306 210306
4 NameTwo_210307 210307
5 NameTwo_210308 210308
6 NameTwo_210309 210309

Select nearest date first day of month in a python dataframe

i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541

How to obtain values from a date column at particular time instance?

I have a df as follows:
Date values
20190101000000 1384.4801224435887
20190101000001 1384.5053056232982
20190101000002 1384.5304889818935
20190101000003 1384.5556725193492
20190101000004 1384.5808562356392
20190101000005 1384.606040130739
20190101000006 1384.631224204622
20190101000007 1384.6564084572635
20190101000008 1384.6815928886372
20190101000009 1384.7067774987179
20190101000010 1384.7319622874802
20190101000011 1384.757147254898
20190101000012 1384.7823324009464
20190101000013 1384.8075177255998
20190101000014 1384.8327032288325
20190101000015 1384.8578889106184
20190101000016 1384.8830747709321
20190101000017 1384.9082608097488
20190101000018 1384.9334470270423
20190101000019 1384.958633422787
20190101000020 1384.9838199969574
20190101000021 1385.0090067495285
20190101000022 1385.034193680474
20190101000023 1385.0593807897685
20190101000024 1385.0845680773864
20190101000025 1385.1097555433028
20190101000026 1385.134943187491
20190101000027 1385.160131009926
20190101000028 1385.1853190105826
20190101000029 1385.2105071894343
20190101000030 1385.2356955464566
where the Date column is of the format %Y%m%d%H%M%S. I take start date and end date as the user inputs and split it in a frequency of 1 second.
Now, I would like to take a second value of frequency from the user and obtain the value from the values column at that instant.
Example:
If the second resolution is 10secs, then the output must be as follows:
start end value
20190101000000 20190101000010 1384.7319622874802
20190101000011 20190101000020 1384.9838199969574
20190101000021 20190101000030 1385.2356955464566
from the above df, we can see that if the resolution is 10sec, then the value at every 10th second must be obtained.
If the second resolution is 15mins, then the output must be as follows:
start end values
20190101000000 20190101001500 1407.2142300429964
20190101001501 20190101003000 1416.6996533329484
20190101003001 20190101004500 1424.2467631293005
How can this be done?
My code till now:
import datetime
import pandas as pd
START_DATE = str(input('Enter start date in %Y-%m-%d %H:%M:%S format: '))
END_DATE = str(input('Enter end date in %Y-%m-%d %H:%M:%S format: '))
RESOLUTION = 'S'
dates = pd.date_range(START_DATE, END_DATE, freq = RESOLUTION)
dates = pd.DataFrame(pd.Series(dates).dt.strftime('%Y%m%d%H%M%S'), columns = ['Date'])
Compare values of datetimes converted to underline format with modulo by timedelta, then crete new column by DataFrame.insert and Series.shift, last remove first row with iloc:
res = '10s'
m = pd.to_datetime(df['Date']).to_numpy().astype(np.int64) % pd.Timedelta(res).value == 0
df = df[m].rename(columns={'Date':'end'})
df.insert(0, 'start', df['end'].shift())
df = df.iloc[1:]
print (df)
start end values
10 20190101000000 20190101000010 1384.7319622874802
20 20190101000010 20190101000020 1384.9838199969574
30 20190101000020 20190101000030 1385.2356955464566
Last for add 1 second use:
df.loc[df.index[1:], 'start'] = (pd.to_datetime(df.loc[df.index[1:], 'start']) +
pd.Timedelta('1s')).dt.strftime('%Y%m%d%H%M%S')
print (df)
start end values
10 20190101000000 20190101000010 1384.7319622874802
20 20190101000011 20190101000020 1384.9838199969574
30 20190101000021 20190101000030 1385.2356955464566
you have to change data type of dates ==>
import pandas as pd
start_date = pd.to_datetime(START_DATE)
end_date = pd.to_datetime(END_DATE)
Resolution = start_date.minute

difference between 2 dates in days saved into new column as float

I have a python dataframe with 2 columns that contain dates as strings e.g. start_date "2002-06-12" and end_date "2009-03-01". I would like to calculate the difference (days) between these 2 columns for each row and save the results into a new column called for example time_diff of type float.
I have tried:
df["time_diff"] = (pd.Timestamp(df.end_date) - pd.Timestamp(df.start_date )).astype("timedelta64[d]")
pd.to_numeric(df["time_diff"])
based on some tutorials but this gives TypeError: Cannot convert input for the first line. What do I need to change to get this running?
Here is a working example of converting a string column of a dataframe to datetime type and saving the time difference between the datetime columns in a new column as a float data type( number of seconds)
import pandas as pd
from datetime import timedelta
tmp = [("2002-06-12","2009-03-01"),("2016-04-28","2022-03-14")]
df = pd.DataFrame(tmp,columns=["col1","col2"])
df["col1"]=pd.to_datetime(df["col1"])
df["col2"]=pd.to_datetime(df["col2"])
df["time_diff"]=df["col2"]-df["col1"]
df["time_diff"]=df["time_diff"].apply(timedelta.total_seconds)
Time difference in seconds can be converted to minutes or days by using simple math.
Try:
import numpy as np
enddates = np.asarray([pd.Timestamp(end) for end in df.end_date.values])
startdates = np.asarray([pd.Timestamp(start) for start in df.start_date.values])
df['time_diff'] = (enddates - startdates).astype("timedelta64")
First convert strings to datetime, then calculate difference in days.
df['start_date'] = pd.to_datetime(df['start_date'], format='%Y-%m-%d')
df['end_date'] = pd.to_datetime(df['end_date'], format='%Y-%m-%d')
df['time_diff'] = (df.end_date - df.start_date).dt.days
You can also do it by converting your columns into date and then computing the difference :
from datetime import datetime
df = pd.DataFrame({'Start Date' : ['2002-06-12', '2002-06-12' ], 'End date' : ['2009-03-01', '2009-03-06']})
df['Start Date'] = [ datetime.strptime(x, "%Y-%m-%d").date() for x in df['Start Date'] ]
df['End date'] = [ datetime.strptime(x, "%Y-%m-%d").date() for x in df['End date'] ]
df['Diff'] = df['End date'] - df['Start Date']
Out :
End date Start Date Diff
0 2009-03-01 2002-06-12 2454 days
1 2009-03-06 2002-06-12 2459 days
You should just use pd.to_datetime to convert your string values:
df["time_diff"] = (pd.to_datetime(df.end_date) - pd.to_datetime(df.start_date))
The resul will automatically be a timedelta64
You can try this :
df = pd.DataFrame()
df['Arrived'] = [pd.Timestamp('01-04-2017')]
df['Left'] = [pd.Timestamp('01-06-2017')]
diff = df['Left'] - df['Arrived']
days = pd.Series(delta.days for delta in (diff)
result = days[0]

Categories

Resources