the problem is i can't think of a way to get the 'mark' from the last day of the previous month.
because I need to compare the current month with the previous month
It's a number generated every day, Mark_LastDayData takes as reference the mark of the last day of the month and replaces it in all values of that same month. 'Mark_LastDayDate_PreviousMonth' 4 it would be like getting the 'Mark_LastDayData' from the previous month so I can make a comparison in the future
I have the following DF
import pandas as pd
from pandas.tseries.offsets import BMonthEnd
import datetime as dt
df = pd.DataFrame({'Found':['A','A','A','A','A','B','B','B'],
'Date':['14/10/2021','19/10/2021','29/10/2021','30/09/2021','20/09/2021','20/10/2021','29/10/2021','15/10/2021'],
#'LastDayMonth':['29/10/2021','29/10/2021','29/10/2021','30/09/2021','30/09/2021','29/10/2021','29/10/2021','29/10/2021'],
'Mark':[1,2,3,4,3,1,2,3]
})
print(df)
**LastDayMonth was obtained through the code
I made some changes to the date
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = pd.to_datetime(df['Date'], format = '%Y/%m/%d')
df['LastDayDate'] = pd.to_datetime(df['Date']) + BMonthEnd(0)
df['LastDayDatePrevMonth'] = pd.to_datetime(df['Date']) - pd.DateOffset(months=1)
I needed the 'Mark' of the last day of the month of each date so I used the method
df = df.merge(df.loc[df['Date'] == df['LastDayDate'], ['Found','LastDayDate','Mark']],
on=['Found', 'LastDayDate'],
how='left', suffixes=['', '_LastDayDate'])
How can I do this to get the 'mark' from the last day of the previous month
in the same column
Sample df that I filled in manually
Related
I had a column in data frame called startEndDate, example: '10.12-20.05.2019', divided those to columns start_date and end_date with same year, example: start_date '10.12.2019' and end_date '20.05.2019'. But year in this example is wrong, as it should be 2018 because start date cannot be after end date. How can I compare entire dataframe and replace values so it contains correct start_dates based on if statement(because some start dates should stay with year as 2019)?
This will show you which rows the start_date is > than the end date
data = {
'Start_Date' : ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'],
'End_Date' : ['2020-02-01', '2019-01-02', '2019-01-03', '2020-01-05']
}
df = pd.DataFrame(data)
df['Start_Date'] = pd.to_datetime(df['Start_Date'], infer_datetime_format=True)
df['End_Date'] = pd.to_datetime(df['End_Date'], infer_datetime_format=True)
df['Check'] = np.where(df['Start_Date'] > df['End_Date'], 'Error', 'No Error')
df
Without seeing more of your data or your intended final data this is the best we will be able to do to help identify problems in the data.
This method first splits up the date string to two dates and creates start and end date columns. Then it subtracts 1 year from the start date if it is greater than the end date.
import pandas as pd
import numpy as np
# mock data
df = pd.DataFrame({"dates": ["10.12-20.05.2019", "02.04-31.10.2019"]})
# split date string to two dates, convert to datetime and stack to columns
df[["start", "end"]] = np.vstack(
df.dates.apply(lambda x: pd.to_datetime(
[x.split("-")[0] + x[-5:],
x.split("-")[1]], format="%d.%m.%Y")))
# subtract 1 year from start date if greater than end date
df["start"] = np.where(df["start"]>df["end"],
df["start"] - pd.DateOffset(years=1),
df["start"])
df
# dates start end
#0 10.12-20.05.2019 2018-12-10 2019-05-20
#1 02.04-31.10.2019 2019-04-02 2019-10-31
Although I have used split here for the initial splitting of the string, as there will always be 5 characters before the hyphen, and the date will always be the last 5 (with the .), there is no need to use the split and instead that line could change to:
df[["start", "end"]] = np.vstack(
df.dates.apply(lambda x: pd.to_datetime(
[x[:5] + x[-5:],
x[6:]], format="%d.%m.%Y")))
My dataframe is sth like below:
ID
Days of Holiday
First Day of Holiday
A01
3
16/03/2021
B01
10
24/03/2021
C02
3
31/03/2021
D03
2
02/04/2021
I am trying to figure out a way to create another column "First Day of Return from holiday".
I tried to loop through DF using iterrow like below (DF above is "Calendar"):
for i, r in Calendar.iterrows():\
Calendar["First Day of Return from holiday"] = Calendar["First Day of Holiday"] + pd.tseries.offsets.BDay(n = r["Days of Holiday"])
And I don't get a correct output with above.
Any other way can you recommend me?
Basically, looking for ways to add/deduct integer column to a datetime column of the same row in business days.
Thanks a ton!
You can use lambda function :
df['First Day of Return from holiday'] = df.apply(lambda row: row.['First Day of Holiday'] + row.['Days of Holiday'], axis=1)
First we convert the First Day of Holiday Date to datetime datatype and Number of Holidays to int datatype, then we initiate the new column by zeros. We can iterate and assign the value to each row of new column using BDay function to get Business Days. Then we can convert the dates back to the date format we require.
However, the date will be converted to a standard format which can be converted back to the format required using .dt.strftime('%d/%m/%Y')
import pandas as pd
from pandas.tseries.offsets import BDay
df['First Day of Holiday'] = pd.to_datetime(df['First Day of Holiday'])
df['Days of Holiday'] = df['Days of Holiday'].astype('int')
df['First Day of Return from holiday'] = [0]*len(df.index)
for i, r in df.iterrows():
df.loc[i, 'First Day of Return from holiday'] = r['First Day of Holiday'] + BDay(n=r['Days of Holiday'])
df['First Day of Holiday'] = df['First Day of Holiday'].dt.strftime('%d/%m/%Y')
df['First Day of Return from holiday'] = pd.to_datetime(df['First Day of Return from holiday'])
df['First Day of Return from holiday'] = df['First Day of Return from holiday'].dt.strftime('%d/%m/%Y')
df
Output
i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541
I have a dataframe with the date and month_diff variable. I would like to get a new date (name it as Target_Date) based on the following logic:
For example, the date is 2/13/2019, month_diff is 3, then the target date should be the month-end of the original date plus 3 months, which is 5/31/2019
I tried the following method to get the traget date first:
df["Target_Date"] = df["Date"] + pd.DateOffset(months = df["month_diff"])
But it failed, as I know, the parameter in the dateoffset should be a varaible or a fixed number.
I also tried:
df["Target_Date"] = df["Date"] + relativedelta(months = df["month_diff"])
It failes too.
Anyone can help? thank you.
edit:
this is a large dataset with millions rows.
You could try this
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame({'Date': [pd.datetime(2019,1,1), pd.datetime(2019,2,1)], 'month_diff': [1,2]})
df.apply(lambda row: row.Date + relativedelta(months=row.month_diff), axis=1)
Or list comprehension
[date + relativedelta(months=month_diff) for date, month_diff in df[['Date', 'month_diff']].values]
I would approach in the following method to compute your "target_date".
Apply the target month offset (in your case +3months), using your pd.DateOffset.
Get the last day of that target month (using for example calendar.monthrange, see also "Get last day of the month"). This will provide you with the "flexible" part of that date" offset.
Apply the flexible day offset, when comparing the result of step 1. and step 2. This could be a new pd.DateOffset.
A solution could look something like this:
import calendar
from dateutil.relativedelta import relativedelta
for ii in df.index:
new_ = df.at[ii, 'start_date'] + relativedelta(months=df.at[ii, 'month_diff'])
max_date = calendar.monthrange(new_.year, new_.month)[1]
end_ = new_ + relativedelta(days=max_date - new_.day)
print(end_)
Further "cleaning" into a function and / or list comprehension will probably make it much faster
import pandas as pd
from datetime import datetime
from datetime import timedelta
This is my approach in solving your issue.
However for some reason I am getting a semantic error in my output even though I am sure it is the correct way. Please everyone correct me if you notice something wrong.
today = datetime.now()
today = today.strftime("%d/%m/%Y")
month_diff =[30,5,7]
n = 30
for i in month_diff:
b = {'Date': today, 'month_diff':month_diff,"Target_Date": datetime.now()+timedelta(days=i*n)}
df = pd.DataFrame(data=b)
Output:
For some reason the i is not getting updated.
I was looking for a solution I can write in one line only and apply does the job. However, by default apply function performs action on each column, so you have to remember to specify correct axis: axis=1.
from datetime import datetime
from dateutil.relativedelta import relativedelta
# Create a new column with date adjusted by number of months from 'month_diff' column and later adjust to the last day of month
df['Target_Date'] = df.apply(lambda row: row.Date # to current date
+ relativedelta(months=row.month_diff) # add month_diff
+ relativedelta(day=+31) # and adjust to the last day of month
, axis=1) # 1 or ‘columns’: apply function to each row.
I'm a beginner in Python. I'm working on a dateset which contains data for several years. This is the sample of the dataset.
enter image description here
Here, Hour(LT) means time and DN(LT) means Day number of the year.
I have tried in Python 3.0 pandas Anaconda to work on this dataset. My final goal is to find out the daily, weekly, monthly and yearly mean so I preferred to convert it into pandas.DatetimeIndex (by resampling would be quite easy I guess!)
I provided the codes I've written until now.
import pandas as pd
import numpy as np
df = pd.read_csv('test_file.txt', sep=' ', delimiter=' ')
#convert the year, month, day int columns into datetime format
year_month = pd.to_datetime(10000 * df.Year +100 * df.Month +df.Day, format='%Y%m%d')
#convert Year, Month, Day, Hour(LT) into DayTimeHour format
year_hour_convert = pd.DataFrame({
'Day': np.array(year_month, dtype=np.datetime64),
'Hour': np.array(df['Hour(LT)'], dtype=np.int64)
})
#merge into "year-month-day-hour" format
year_hour = pd.to_datetime(year_hour_convert.Day) + pd.to_timedelta(year_hour_convert.Hour, unit='h')
#Define a new column for Time Series
df['DateTime'] = year_hour
#Drop unnecessary columns
df = df.drop(['Year', 'Month', 'Day', 'Hour(LT)', 'DN(LT)'], axis=1)
#Set YYYYMMDD HHMMSS as index
df = df.set_index('DateTime')
#Choose the data for 9 a.m. to 3 p.m.
df = df.between_time('09:00:00', '15:00:00')
I have turned my dataset into this format. I dropped the 'Year', 'Month', 'Day', 'Hour(LT)', 'DN(LT)'columns eventually. I provided the picture of that format.enter image description here
Now I want to filter the data if a certain number of data is available for a certain day. For instance, if the number of data for 4th Jan, 2016 as well as 2016-01-04 is above 4, I will take the data of that day. Otherwise, I will drop the data of that day.
How can I do it in Pandas?