I have a data frame with time series data. In one column I have signup dates, and in the other cancel dates. I want to add a date for missing cancel dates that is less than a specific date, but maximum 40 weeks.
How should I proceed?
if df['cancel_date'] is NaT, then add date max. + 40 weeks.
df['cancel_date'] - df['signup_date'] should not be less than 0.
IIUC, you can use Series.fillna with pandas.Timedelta class.
If adding 40 weeks to the records signup_date:
df['cancel_date'] = df['cancel_date'].fillna(df['signup_date'] + pd.Timedelta(40, 'W'))
If adding 40 weeks to maximum date in the sign_up column:
df['cancel_date'] = df['cancel_date'].fillna(df['signup_date'].max() + pd.Timedelta(40, 'W'))
Or if using some predefined max date value, with the constraint that signup_date < cancel_date, chain on the clip method:
max_date = pd.datetime(2018, 4, 30)
df['cancel_date'] = df['cancel_date'].fillna(max_date + pd.Timedelta(40, 'W')).clip(lower=df.signup_date)
I would use numpy.where, if you want to append a difference column directly between singup date and cancel date:
df['date difference between signup and cancel'] = np.where(df['cancel_date'] == np.nan, (df['signup_date'].max() + pd.Timedelta(40, 'W'))-df['signup_date'], df['cancel_date']-df['signup_date'])
This will give you a new column where you would have directly the difference between the signup date and the cancel date
Related
I have uploaded a big file and created a DataFrame for it.
Now i want to update some of the columns containing timestamps as well if possible update columns with dates based on that.
The reason is that i want to adjust for daylight saving time, and the list i am working with is GMT time so i need to adjust the timestamps on it.
Example that works:
df_winter2['Confirmation_Time'] = pd.to_datetime(df_winter2['Confirmation_Time'].astype(str)) + pd.DateOffset(hours=7)
df_summer['Confirmation_Time'] = pd.to_datetime(df_summer['Confirmation_Time'].astype(str)) + pd.DateOffset(hours=6)
I want to write a function that first add the 6 or 7 hours to the DataFrame based on if it is summertime or wintertime.
If it is possible as well i want to update the date column if the timestamp is > 16:00 with + 1 day,
the date column is called df['Creation_Date']
This should work for the function if it is wintertime.
def wintertime(date_time):
year, month, day = dt.timetuple()[0:3]
if (month < 3) or (month == 12 and day < 21):
return True
else:
return False
Now I am guessing you also want to loop through your df and update the time respectively which you could do with the following:
for i, length in enumerate (df):
date_time = df['Confirmation_Time'][i]
if wintertime(date_time):
df['Confirmation_Time'][i] = pd.to_datetime(df['Confirmation_Time'][i].astype(str)) + pd.DateOffset(hours=7)
else:
df['Confirmation_Time'][i] = pd.to_datetime(df['Confirmation_Time'][i].astype(str)) + pd.DateOffset(hours=6)
return df
I'm trying to group an xarray.Dataset object into a custom 5-month period spanning from October-January with an annual frequency. This is complicated because the period crosses New Year.
I've been trying to use the approach
wb_start = temperature.sel(time=temperature.time.dt.month.isin([10,11,12,1]))
wb_start1 = wb_start.groupby('time.year')
But this predictably makes the January month of the same year, instead of +1 year. Any help would be appreciated!
I fixed this in a somewhat clunk albeit effective way by adding a year to the months after January. My method essentially moves the months 10,11,12 up one year while leaving the January data in place, and then does a groupby(year) instance on the reindexed time data.
wb_start = temperature.sel(time=temperature.time.dt.month.isin([10,11,12,1]))
# convert cftime to datetime
datetimeindex = wb_start.indexes['time'].to_datetimeindex()
wb_start['time'] = pd.to_datetime(datetimeindex)
# Add custom group by year functionality
custom_year = wb_start['time'].dt.year
# convert time type to pd.Timestamp
time1 = [pd.Timestamp(i) for i in custom_year['time'].values]
# Add year to Timestamp objects when month is before Jan. (relativedelta does not work from np.datetime64)
time2 = [i + relativedelta(years=1) if i.month>=10 else i for i in time1]
wb_start['time'] = time2
#Groupby using the new time index
wb_start1 = wb_start.groupby('time.year')
I have a DataFrame with 96 records each day, for 5 consecutive days.
Data: {'value': {Timestamp ('2018-05-03 00:07:30'): 13.02657778, Timestamp ('2018-05-03 00:22:30'): 10.89890556, Timestamp ('2018-05-03 00:37:30'): 11.04877222,... (more days and times)
Datatypes: DatetimeIndex (index column) and float64 ('flow' column).
I want to save 10 records before an indicated hour (H), of each day.
I only managed to do that for one day:
df.loc[df['time'] < '09:07:30'].tail(10)
You can group your data by day (or by month or by other ranges) using pandas.Grouper (see also this discussion).
In your case, use something like:
df.groupby(pd.Grouper(freq='D')).tail(10)
EDIT:
For getting all rows before a given hour, use df.loc[df.index.hour < H] (as already proposed in simpleApp's answer) where H is the hour as an integer value.
So in one line:
df.loc[df.index.hour < H].groupby(pd.Grouper(freq='D')).tail(10)
I would suggest filter the record by an hour and then group by date.
Data setup:
import pandas as pd
start, end = '2020-10-01 01:00:00', '2021-04-30 23:30:00'
rng = pd.date_range(start, end, freq='5min')
df=pd.DataFrame(rng,columns=['DateTS'])
set the hour
noon_hour=12 # fill the hour , for filteration
Result, if head or tail does not work on your data, you would need to sort it.
df_before_noon=df.loc[df['DateTS'].dt.hour < noon_hour] # records before noon
df_result=df_before_noon.groupby([df_before_noon['DateTS'].dt.date]).tail(10) # group by date
i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541
Following some tutorials I am trying to filter my data by dates selected from a dropdown menu. I have set my date column as the index and tested that all the values are of type datetime but I am receiving the following error:
TypeError("'<' not supported between instances of 'str' and 'datetime.date'",)
Data:
CustomerName,OrderDate,Item,ItemSKU,Price,Quantity,Channel,Total
Joe,Blog,26/09/2018,Rocks,Rock001,10.99,10,Amazon,100.99
Joe,Blog,26/08/2018,Rocks,Rock001,10.99,10,Amazon,100.99
Joe,Blog,26/07/2018,Rocks,Rock001,10.99,10,Amazon,100.99
Code:
Values year, month are returned from user selection
firstDayMonth = datetime.date(year, month, 1)
daysHolder = monthrange(year, month)
lastDayMonth = datetime.date(year, month, daysHolder[1])
df = pd.read_csv("C:/Users/User/Desktop/testData.csv")
gb = df.groupby(['Channel'])
Amz = gb.get_group('Amazon')
df = Amz .set_index(Amz ['OrderDate'])
df['OrderDate'] = df['OrderDate'].astype('datetime64[ns]')
newData = df.loc[firstDayMonth:lastDayMonth]
So it seems I just need to switch the order of the dates in the slice. Using newData = df.loc[lastDayMonth:firstDayMonth] but newData = df.loc[firstDayMonth:lastDayMonth] doesnt work. I think this is due to in my data my data is decending from latest date to oldest.
While you do set the index to the OrderDate column, you do so before you set the type of that column to datetime. You probably need to change the type before using the column as the index, otherwise your indexing with loc fails.