Pandas: number of days elapsed since a certain date - python

I have a dataframe with a 'date' column with ~200 elements in the format yyyy-mm-dd.
I want to compute the number of days elapsed since 2001-11-25 for each of those elements and add a column of those numbers of elapsed days to the dataframe.
I know of the to_datetime() function but can't figure out how to make this happen.

Assuming your time values are in your index, you can just do this:
import pandas
x = pandas.DatetimeIndex(start='2014-01-01', end='2014-01-06', freq='30T')
df = pandas.DataFrame(index=x, columns=['time since'])
basedate = pandas.Timestamp('2011-11-25')
df['time since'] = df.apply(lambda x: (x.name.to_datetime() - basedate).days, axis=1)
If they're in a column, do:
df['time since'] = df['datetime_column'].apply(lambda x: (x.name.to_datetime() - basedate).days)

In accordance with Jeff's comment, here's a correction to the second (and most relevant) part of the accepted answer:
df['time since'] = (df['datetime_column'] - basedate).dt.days
The subtraction yields a series of type Timedelta, which can then be represented as days.
In some case you might need to pass the original column through pd.to_datetime(...) first.

Related

Is there a Python function to convert days in decimal into HH:MM:SS?

I am trying to style a dataframe and export it into another excel sheet. I converted the reference time and time column in the dataframe to timedelta. After conversion and applying the style, the corresponding column values have changed into days in decimal.
How do I change these decimal values to HH:MM:SS or do I have to convert them by using the usual formulas(like multiplying it by 24 first to get the hours) or is there a way to retain the original values? For example- 0.255636574074074 gets converted to 6:08:07.
What I have done is -
df['Total Time'] = pd.to_timedelta(df['Total Time)'].astype(str))
df_styled = df.style.applymap(lambda x: 'background-color: %s' % 'red' if x > threshold else 'background-color: %s' % 'white', subset=['Total Time'])
The original column values are datetime values which were changed to timedelta.
Use parameter unit='d' for convert values to timedeltas:
a = pd.to_timedelta(0.255636574074074, unit='d')
print (a)
0 days 06:08:06.999993600
So in your solution:
df['col'] = pd.to_timedelta(df['col'], unit='d')

Filter each column by having the same value three times or more

I have a Data set that contains Dates as an index, and each column is the name of an item with count as value. I'm trying to figure out how to filter each column where there will be more than 3 consecutive days where the count is zero for each different column. I was thinking of using a for loop, any help is appreciated. I'm using python for this project.
I'm fairly new to python, so far I tried using for loops, but did not get it to work in any way.
for i in a.index:
if a.loc[i,'name']==3==df.loc[i+1,'name']==df.loc[i+2,'name']:
print(a.loc[i,"name"])
Cannot add integral value to Timestamp without freq.
It would be better if you included a sample dataframe and desired output in your question. Please do the next time. This way, I have to guess what your data looks like and may not be answering your question. I assume the values are integers. Does your dataframe have a row for every day? I will assume that might not be the case. I will make it so that every day in the last delta days has a row. I created a sample dataframe like this:
import pandas as pd
import numpy as np
import datetime
# Here I am just creating random data from your description
delta = 365
start_date = datetime.datetime.now() - datetime.timedelta(days=delta)
end_date = datetime.datetime.now()
datetimes = [end_date - diff for diff in [datetime.timedelta(days=i) for i in range(delta,0,-1)]]
# This is the list of dates we will have in our final dataframe (includes all days)
dates = pd.Series([date.strftime('%Y-%m-%d') for date in datetimes], name='Date', dtype='datetime64[ns]')
# random integer dataframe
df = pd.DataFrame(np.random.randint(0, 5, size=(delta,4)), columns=['item' + str(i) for i in range(4)])
df = pd.concat([df, dates], axis=1).set_index('Date')
# Create a missing day
df = df.drop(df.loc['2019-08-01'].name)
# Reindex so that index has all consecutive days
df = df.reindex(index=dates)
Now that we have a sample dataframe, the rest will be straightforward. I am going to check if a value in the dataframe is equal to 0 and then do a rolling sum with the window of 4 (>3). This way I can avoid for loops. The resulting dataframe has all the rows where at least one of the items had a value of 0 for 4 consecutive rows. If there is a 0 for more than window consecutive rows, it will show as two rows where the dates are just one day apart. I hope that makes sense.
# custom function as I want "np.nan" returned if a value does not equal "test_value"
def equals(df_value, test_value=0):
return 1 if df_value == test_value else np.nan
# apply the function to every value in the dataframe
# for each row, calculate the sum of four subsequent rows (>3)
df = df.applymap(equals).rolling(window=4).sum()
# if there was np.nan in the sum, the sum is np.nan, so it can be dropped
# keep the rows where there is at least 1 value
df = df.dropna(thresh=1)
# drop all columns that don't have any values
df = df.dropna(thresh=1, axis=1)

Sort dataframe by columns names if the columns are dates, pandas?

my df columns names are dates in this format: dd-mm-yy. when I use sort_index(axis = 1) it sort by the first two digits (which specify the days) so it doesn't make sense chronologically. How can I sort it automatically by taking into account also the months?
my df headers:
submitted_at 06-05-18 13-05-18 29-04-18
I expected the output of:
submitted_at 29-04-18 06-05-18 13-05-18
Convert the columns to datetime and use argsort to find the correct ordering. This will put all non-dates to the left in the order they occur, followed by the sorted dates.
import pandas as pd
df = pd.DataFrame(columns=['submitted_at', '06-05-18', '13-05-18', '29-04-18'])
idx = pd.to_datetime(df.columns, errors='coerce', format='%d-%m-%y').argsort()
df.iloc[:, idx]
Empty DataFrame
Columns: [submitted_at, 29-04-18, 06-05-18, 13-05-18]
Converting strings to datetime then sorting them with something like this :
from datetime import datetime
cols_as_date = [datetime.strptime(x,'%d-%m-%Y') for x in df.columns]
df = df[sorted(cols_as_data)]
just convert to DateTime your column
df['newdate']=pd.to_datetime(df.date,format='%d-%m-%y')
and then sort it using sort_values
df.sort_values(by='newdate')

find nearest future date from columns

Please help with finding the next date from today for each row item from the four columns as show below. I have been stuck at this for a while now.
InDate1 InDate2 InDate3 InDate4
284075 2018-03-07 2018-09-07 2019-03-07 2019-01-21
334627 2018-03-07 2018-09-07 2019-03-07 2019-09-07
Using lookup:
For each row, find the column that holds the closest future date:
import pandas as pd
s = (df.apply(pd.to_datetime) # If not already datetime
.apply(lambda x: (x - pd.to_datetime('today')).dt.total_seconds())
.where(lambda x: x.gt(0)).idxmin(1))
print(s)
#284075 InDate3
#334627 InDate3
#dtype: object
Then lookup the values for each row:
df.lookup(s.index, s)
#array(['2019-03-07', '2019-03-07'], dtype=object)
To elaborate on what this does, you can look at what each part does separately
First, determine the difference in time between your DataFrame and today. .apply(pd.to_datetime) converts everything to a datetime so it can do arithmetic with the dates, and the second apply finds the time difference, converting it from a timedelta to the number of seconds, which is just a float. (Not sure why simple df - pd.to_datetime('today') doesn't quite work and the apply is needed)
s = (df.apply(pd.to_datetime) # If not already datetime
.apply(lambda x: (x - pd.to_datetime('today')).dt.total_seconds()))
print(s)
InDate1 InDate2 InDate3 InDate4
284075 -2.769565e+07 -1.179805e+07 3.840347e+06 -4.765262e+04
334627 -2.769565e+07 -1.179805e+07 3.840347e+06 1.973795e+07
Dates in the future will have a positive time difference, so I use .where to find only the cells that have positive values, replacing everything else with NaN
s = s.where(lambda x: x.gt(0))
# Could use s.where(s.gt(0)) here since `s` is defined
print(s)
InDate1 InDate2 InDate3 InDate4
284075 NaN NaN 3.840347e+06 NaN
334627 NaN NaN 3.840347e+06 1.973795e+07
Then .idxmin(axis=1) will return the column that has the minimum value (ignoring NaN), for each row (axis=1), which is the closest future date.
s.idxmin(1)
print(s)
284075 InDate3
334627 InDate3
dtype: object
Finally, DataFrame.lookup to lookup the original date in that cell is fairly self-explanatory.
Please check this.
First stack date values into rows so that we can apply minimum and today comparisons.
df1 = df.stack().reset_index()
df1.columns = ["ID", "Field", "Date"]
Then filter data with today and find out minimum date.
df1 = df1[df1.Date > datetime.datetime.now()].groupby("ID").agg("min").reset_index()
Then pivot resulted date and before it, just assign a static value for determine as single column header instead of IntDate1..etc.
df1.Field = "MinValue"
df1 = df1.pivot(index="ID", columns="Field", values="Date").reset_index()
Finally merge minimum date value dataframe with original dataframe.
df = df.merge(df1, how="left")

Extracting the hour from a time column in pandas

Suppose I have the following dataset:
How would I create a new column, to be the hour of the time?
For example, the code below works for individual times, but I haven't been able to generalise it for a column in pandas.
t = datetime.strptime('9:33:07','%H:%M:%S')
print(t.hour)
Use to_datetime to datetimes with dt.hour:
df = pd.DataFrame({'TIME':['9:33:07','9:41:09']})
#should be slowier
#df['hour'] = pd.to_datetime(df['TIME']).dt.hour
df['hour'] = pd.to_datetime(df['TIME'], format='%H:%M:%S').dt.hour
print (df)
TIME hour
0 9:33:07 9
1 9:41:09 9
If want working with datetimes in column TIME is possible assign back:
df['TIME'] = pd.to_datetime(df['TIME'], format='%H:%M:%S')
df['hour'] = df['TIME'].dt.hour
print (df)
TIME hour
0 1900-01-01 09:33:07 9
1 1900-01-01 09:41:09 9
My suggestion:
df = pd.DataFrame({'TIME':['9:33:07','9:41:09']})
df['hour']= df.TIME.str.extract("(^\d+):", expand=False)
"str.extract(...)" is a vectorized function that extract a regular expression pattern ( in our case "(^\d+):" which is the hour of the TIME) and return a Pandas Series object by specifying the parameter "expand= False"
The result is stored in the "hour" column
You can use extract() twice to feature out the 'hour' column
df['hour'] = df. TIME. str. extract("(\d+:)")
df['hour'] = df. hour. str. extract("(\d+)")

Categories

Resources