find nearest future date from columns - python

Please help with finding the next date from today for each row item from the four columns as show below. I have been stuck at this for a while now.
InDate1 InDate2 InDate3 InDate4
284075 2018-03-07 2018-09-07 2019-03-07 2019-01-21
334627 2018-03-07 2018-09-07 2019-03-07 2019-09-07

Using lookup:
For each row, find the column that holds the closest future date:
import pandas as pd
s = (df.apply(pd.to_datetime) # If not already datetime
.apply(lambda x: (x - pd.to_datetime('today')).dt.total_seconds())
.where(lambda x: x.gt(0)).idxmin(1))
print(s)
#284075 InDate3
#334627 InDate3
#dtype: object
Then lookup the values for each row:
df.lookup(s.index, s)
#array(['2019-03-07', '2019-03-07'], dtype=object)
To elaborate on what this does, you can look at what each part does separately
First, determine the difference in time between your DataFrame and today. .apply(pd.to_datetime) converts everything to a datetime so it can do arithmetic with the dates, and the second apply finds the time difference, converting it from a timedelta to the number of seconds, which is just a float. (Not sure why simple df - pd.to_datetime('today') doesn't quite work and the apply is needed)
s = (df.apply(pd.to_datetime) # If not already datetime
.apply(lambda x: (x - pd.to_datetime('today')).dt.total_seconds()))
print(s)
InDate1 InDate2 InDate3 InDate4
284075 -2.769565e+07 -1.179805e+07 3.840347e+06 -4.765262e+04
334627 -2.769565e+07 -1.179805e+07 3.840347e+06 1.973795e+07
Dates in the future will have a positive time difference, so I use .where to find only the cells that have positive values, replacing everything else with NaN
s = s.where(lambda x: x.gt(0))
# Could use s.where(s.gt(0)) here since `s` is defined
print(s)
InDate1 InDate2 InDate3 InDate4
284075 NaN NaN 3.840347e+06 NaN
334627 NaN NaN 3.840347e+06 1.973795e+07
Then .idxmin(axis=1) will return the column that has the minimum value (ignoring NaN), for each row (axis=1), which is the closest future date.
s.idxmin(1)
print(s)
284075 InDate3
334627 InDate3
dtype: object
Finally, DataFrame.lookup to lookup the original date in that cell is fairly self-explanatory.

Please check this.
First stack date values into rows so that we can apply minimum and today comparisons.
df1 = df.stack().reset_index()
df1.columns = ["ID", "Field", "Date"]
Then filter data with today and find out minimum date.
df1 = df1[df1.Date > datetime.datetime.now()].groupby("ID").agg("min").reset_index()
Then pivot resulted date and before it, just assign a static value for determine as single column header instead of IntDate1..etc.
df1.Field = "MinValue"
df1 = df1.pivot(index="ID", columns="Field", values="Date").reset_index()
Finally merge minimum date value dataframe with original dataframe.
df = df.merge(df1, how="left")

Related

How to use a previous row value in a pandas dataframe when the previous value is also calculated witht group data

I Have such DataFrame:
df = pd.DataFrame({'id': [111,111,111, 222,222,222],\
'Date': ['30.04.2020', '31.05.2020', '30.06.2020', \
'30.04.2020', '31.05.2020', '30.06.2020'],\
'Debt': [100,100,70, 200,200,200] , \
'Ear_coef': [0,0.2,0.2, 0,0,0.3]})
df['Date'] = pd.to_datetime(df['Date'] )
df['Contract'] = pd.DataFrame(df.groupby(['id']).apply(lambda x: x.Debt - x.Debt.shift(1))).reset_index().Debt
# df.groupby(['id']).
df
I need to get such DataFrame:
The start DataFrame:
the first column is contract id;
the second column is Date;
the third column is coeficient of prepayment (EAR);
4-th column is contract payment;
The result DataFrame:
5-th column is EAR. It equels Ear_coef(t) * Debt_with_EAR(t-1)
6-th column is Debt_with_EAR. It equels Debt_with_EAR(t-1)+Contract(t)+EAR(t)
Ear and Debt_with_EAR at first date equels 0 and Debt at respectively.
I have tried to solve such task with apply. But I have not had a success since I need to use previous value which is also calculated.
This answers do not help me Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?
since I Have hundreds id.
I will be grateful for the help.
You are looking for .shift().
It does not lend itself easily for .apply() however. A work-around would be:
df['EAR'] = df['EAR_coef'] * df['Debt with EAR'].shift(1)
For you last column you might need .rolling(), but I am not sure about your formula? It seems never-ending.

Pandas dataframe apply to datetime column : does not work

I have a dataframe with a datetime column. I want to apply a function to set value as None in this column if the date is inferior to an other date. But the applied fonction set all my values to None. Could you help me ?
Here my code :
dateused = datetime.datetime.strptime('202004', '%Y%m')
df['date_pack'] =df['date_pack'].apply(lambda x: None if x < dateused else x)
The dtype of df['date_pack'] is datetime64[ns].
After this, all my values in my column 'date_pack' are None.
Thanks
I think you need Series.mask with set values to NaT for misisng values for datetimes:
df = pd.DataFrame({'date_pack': pd.to_datetime(['2020-08-10','2002-02-09'])})
dateused = datetime.datetime.strptime('202004', '%Y%m')
df['date_pack'] = df['date_pack'].mask(df['date_pack'] < dateused)
print (df)
date_pack
0 2020-08-10
1 NaT

Filter each column by having the same value three times or more

I have a Data set that contains Dates as an index, and each column is the name of an item with count as value. I'm trying to figure out how to filter each column where there will be more than 3 consecutive days where the count is zero for each different column. I was thinking of using a for loop, any help is appreciated. I'm using python for this project.
I'm fairly new to python, so far I tried using for loops, but did not get it to work in any way.
for i in a.index:
if a.loc[i,'name']==3==df.loc[i+1,'name']==df.loc[i+2,'name']:
print(a.loc[i,"name"])
Cannot add integral value to Timestamp without freq.
It would be better if you included a sample dataframe and desired output in your question. Please do the next time. This way, I have to guess what your data looks like and may not be answering your question. I assume the values are integers. Does your dataframe have a row for every day? I will assume that might not be the case. I will make it so that every day in the last delta days has a row. I created a sample dataframe like this:
import pandas as pd
import numpy as np
import datetime
# Here I am just creating random data from your description
delta = 365
start_date = datetime.datetime.now() - datetime.timedelta(days=delta)
end_date = datetime.datetime.now()
datetimes = [end_date - diff for diff in [datetime.timedelta(days=i) for i in range(delta,0,-1)]]
# This is the list of dates we will have in our final dataframe (includes all days)
dates = pd.Series([date.strftime('%Y-%m-%d') for date in datetimes], name='Date', dtype='datetime64[ns]')
# random integer dataframe
df = pd.DataFrame(np.random.randint(0, 5, size=(delta,4)), columns=['item' + str(i) for i in range(4)])
df = pd.concat([df, dates], axis=1).set_index('Date')
# Create a missing day
df = df.drop(df.loc['2019-08-01'].name)
# Reindex so that index has all consecutive days
df = df.reindex(index=dates)
Now that we have a sample dataframe, the rest will be straightforward. I am going to check if a value in the dataframe is equal to 0 and then do a rolling sum with the window of 4 (>3). This way I can avoid for loops. The resulting dataframe has all the rows where at least one of the items had a value of 0 for 4 consecutive rows. If there is a 0 for more than window consecutive rows, it will show as two rows where the dates are just one day apart. I hope that makes sense.
# custom function as I want "np.nan" returned if a value does not equal "test_value"
def equals(df_value, test_value=0):
return 1 if df_value == test_value else np.nan
# apply the function to every value in the dataframe
# for each row, calculate the sum of four subsequent rows (>3)
df = df.applymap(equals).rolling(window=4).sum()
# if there was np.nan in the sum, the sum is np.nan, so it can be dropped
# keep the rows where there is at least 1 value
df = df.dropna(thresh=1)
# drop all columns that don't have any values
df = df.dropna(thresh=1, axis=1)

Row wise operations in pandas dataframe based on dates (sorting issue)

This question has two parts:
1) Is there a better way to do this?
2) If NO to #1, how can I fix my date issue?
I have a dataframe as follows
GROUP DATE VALUE DELTA
A 12/20/2015 2.5 ??
A 11/30/2015 25
A 1/31/2016 8.3
B etc etc
B etc etc
C etc etc
C etc etc
This is a representation, there are close to 100 rows for each group (each row representing a unique date).
For each letter in GROUP, I want to find the change in value between successive dates. So for example for GROUP A I want the change between 11/30/2015 and 12/20/2015, which is -22.5. Currently I am doing the following:
df['DATE'] = pd.to_datetime(df['DATE'],infer_datetime_format=True)
df.sort_values('DATE',ascending=True)
df_out = []
for GROUP in df.GROUP.unique():
x = df[df.GROUP == GROUP]
x['VALUESHIFT'] = x['VALUE'].shift(+1)
x['DELTA'] = x['VALUE'].sub(x['VALUESHIFT'])
df_out.append(x)
df_out = pd.concat(df_out)
The challenge I am running into is the dates are not sorted correctly. So when the shift takes place and I calculate the delta it is not really the delta between successive dates.
Is this the right approach to handle? If so how can I fix my date issue? I have reviewed/tried the following to no avail:
Applying datetime format in pandas for sorting
how to make a pandas dataframe column into a datetime object showing just the date to correctly sort
doing calculations in pandas dataframe based on trailing row
Pandas - Split dataframe into multiple dataframes based on dates?
Answering my own question. This works:
df['DATE'] = pd.to_datetime(df['DATE'],infer_datetime_format=True)
df_out = []
for ID in df.GROUP.unique():
x = df[df.GROUP == ID]
x.sort_values('DATE',ascending=True, inplace=True)
x['VALUESHIFT'] = x['VALUE'].shift(+1)
x['DELTA'] = x['VALUE'].sub(x['VALUESHIFT'])
df_out.append(x)
df_out = pd.concat(df_out)
1) Added inplace=True to sort value.
2) Added the sort within the for loop.
3) Changed by loop from using GROUP to ID since it is also the name of a column name, which I imagine is considered sloppy?

Pandas: number of days elapsed since a certain date

I have a dataframe with a 'date' column with ~200 elements in the format yyyy-mm-dd.
I want to compute the number of days elapsed since 2001-11-25 for each of those elements and add a column of those numbers of elapsed days to the dataframe.
I know of the to_datetime() function but can't figure out how to make this happen.
Assuming your time values are in your index, you can just do this:
import pandas
x = pandas.DatetimeIndex(start='2014-01-01', end='2014-01-06', freq='30T')
df = pandas.DataFrame(index=x, columns=['time since'])
basedate = pandas.Timestamp('2011-11-25')
df['time since'] = df.apply(lambda x: (x.name.to_datetime() - basedate).days, axis=1)
If they're in a column, do:
df['time since'] = df['datetime_column'].apply(lambda x: (x.name.to_datetime() - basedate).days)
In accordance with Jeff's comment, here's a correction to the second (and most relevant) part of the accepted answer:
df['time since'] = (df['datetime_column'] - basedate).dt.days
The subtraction yields a series of type Timedelta, which can then be represented as days.
In some case you might need to pass the original column through pd.to_datetime(...) first.

Categories

Resources