Related
I am creating a DataFrame from a csv as follows:
stock = pd.read_csv('data_in/' + filename + '.csv', skipinitialspace=True)
The DataFrame has a date column. Is there a way to create a new DataFrame (or just overwrite the existing one) which only contains rows with date values that fall within a specified date range or between two specified date values?
There are two possible solutions:
Use a boolean mask, then use df.loc[mask]
Set the date column as a DatetimeIndex, then use df[start_date : end_date]
Using a boolean mask:
Ensure df['date'] is a Series with dtype datetime64[ns]:
df['date'] = pd.to_datetime(df['date'])
Make a boolean mask. start_date and end_date can be datetime.datetimes,
np.datetime64s, pd.Timestamps, or even datetime strings:
#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)
Select the sub-DataFrame:
df.loc[mask]
or re-assign to df
df = df.loc[mask]
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])
yields
0 1 2 date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10
Using a DatetimeIndex:
If you are going to do a lot of selections by date, it may be quicker to set the
date column as the index first. Then you can select rows by date using
df.loc[start_date:end_date].
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])
yields
0 1 2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337
While Python list indexing, e.g. seq[start:end] includes start but not end, in contrast, Pandas df.loc[start_date : end_date] includes both end-points in the result if they are in the index. Neither start_date nor end_date has to be in the index however.
Also note that pd.read_csv has a parse_dates parameter which you could use to parse the date column as datetime64s. Thus, if you use parse_dates, you would not need to use df['date'] = pd.to_datetime(df['date']).
I feel the best option will be to use the direct checks rather than using loc function:
df = df[(df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')]
It works for me.
Major issue with loc function with a slice is that the limits should be present in the actual values, if not this will result in KeyError.
You can also use between:
df[df.some_date.between(start_date, end_date)]
You can use the isin method on the date column like so
df[df["date"].isin(pd.date_range(start_date, end_date))]
Note: This only works with dates (as the question asks) and not timestamps.
Example:
import numpy as np
import pandas as pd
# Make a DataFrame with dates and random numbers
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
# Select the rows between two dates
in_range_df = df[df["date"].isin(pd.date_range("2017-01-15", "2017-01-20"))]
print(in_range_df) # print result
which gives
0 1 2 date
14 0.960974 0.144271 0.839593 2017-01-15
15 0.814376 0.723757 0.047840 2017-01-16
16 0.911854 0.123130 0.120995 2017-01-17
17 0.505804 0.416935 0.928514 2017-01-18
18 0.204869 0.708258 0.170792 2017-01-19
19 0.014389 0.214510 0.045201 2017-01-20
Keeping the solution simple and pythonic, I would suggest you to try this.
In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates.
import pandas as pd
data_frame = data_frame.set_index('date')
df = data_frame[(data_frame.index > '2017-08-10') & (data_frame.index <= '2017-08-15')]
pandas 0.22 has a between() function.
Makes answering this question easier and more readable code.
# create a single column DataFrame with dates going from Jan 1st 2018 to Jan 1st 2019
df = pd.DataFrame({'dates':pd.date_range('2018-01-01','2019-01-01')})
Let's say you want to grab the dates between Nov 27th 2018 and Jan 15th 2019:
# use the between statement to get a boolean mask
df['dates'].between('2018-11-27','2019-01-15', inclusive=False)
0 False
1 False
2 False
3 False
4 False
# you can pass this boolean mask straight to loc
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=False)]
dates
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
335 2018-12-02
Notice the inclusive argument. very helpful when you want to be explicit about your range. notice when set to True we return Nov 27th of 2018 as well:
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
dates
330 2018-11-27
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
This method is also faster than the previously mentioned isin method:
%%timeit -n 5
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
868 µs ± 164 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
%%timeit -n 5
df.loc[df['dates'].isin(pd.date_range('2018-01-01','2019-01-01'))]
1.53 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
However, it is not faster than the currently accepted answer, provided by unutbu, only if the mask is already created. but if the mask is dynamic and needs to be reassigned over and over, my method may be more efficient:
# already create the mask THEN time the function
start_date = dt.datetime(2018,11,27)
end_date = dt.datetime(2019,1,15)
mask = (df['dates'] > start_date) & (df['dates'] <= end_date)
%%timeit -n 5
df.loc[mask]
191 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
Another option, how to achieve this, is by using pandas.DataFrame.query() method. Let me show you an example on the following data frame called df.
>>> df = pd.DataFrame(np.random.random((5, 1)), columns=['col_1'])
>>> df['date'] = pd.date_range('2020-1-1', periods=5, freq='D')
>>> print(df)
col_1 date
0 0.015198 2020-01-01
1 0.638600 2020-01-02
2 0.348485 2020-01-03
3 0.247583 2020-01-04
4 0.581835 2020-01-05
As an argument, use the condition for filtering like this:
>>> start_date, end_date = '2020-01-02', '2020-01-04'
>>> print(df.query('date >= #start_date and date <= #end_date'))
col_1 date
1 0.244104 2020-01-02
2 0.374775 2020-01-03
3 0.510053 2020-01-04
If you do not want to include boundaries, just change the condition like following:
>>> print(df.query('date > #start_date and date < #end_date'))
col_1 date
2 0.374775 2020-01-03
You can use the method truncate:
dates = pd.date_range('2016-01-01', '2016-01-06', freq='d')
df = pd.DataFrame(index=dates, data={'A': 1})
A
2016-01-01 1
2016-01-02 1
2016-01-03 1
2016-01-04 1
2016-01-05 1
2016-01-06 1
Select data between two dates:
df.truncate(before=pd.Timestamp('2016-01-02'),
after=pd.Timestamp('2016-01-4'))
Output:
A
2016-01-02 1
2016-01-03 1
2016-01-04 1
It is highly recommended to convert a date column to an index. Doing that will give a lot of facilities. One is to select the rows between two dates easily, you can see this example:
import numpy as np
import pandas as pd
# Dataframe with monthly data between 2016 - 2020
df = pd.DataFrame(np.random.random((60, 3)))
df['date'] = pd.date_range('2016-1-1', periods=60, freq='M')
To select the rows between 2017-01-01 and 2019-01-01, you need only to convert the date column to an index:
df.set_index('date', inplace=True)
and then only slicing:
df.loc['2017':'2019']
You can select the date column as index while reading the csv file directly instead of the df.set_index():
df = pd.read_csv('file_name.csv',index_col='date')
I prefer not to alter the df.
An option is to retrieve the index of the start and end dates:
import numpy as np
import pandas as pd
#Dummy DataFrame
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
#Get the index of the start and end dates respectively
start = df[df['date']=='2017-01-07'].index[0]
end = df[df['date']=='2017-01-14'].index[0]
#Show the sliced df (from 2017-01-07 to 2017-01-14)
df.loc[start:end]
which results in:
0 1 2 date
6 0.5 0.8 0.8 2017-01-07
7 0.0 0.7 0.3 2017-01-08
8 0.8 0.9 0.0 2017-01-09
9 0.0 0.2 1.0 2017-01-10
10 0.6 0.1 0.9 2017-01-11
11 0.5 0.3 0.9 2017-01-12
12 0.5 0.4 0.3 2017-01-13
13 0.4 0.9 0.9 2017-01-14
Inspired by unutbu
print(df.dtypes) #Make sure the format is 'object'. Rerunning this after index will not show values.
columnName = 'YourColumnName'
df[columnName+'index'] = df[columnName] #Create a new column for index
df.set_index(columnName+'index', inplace=True) #To build index on the timestamp/dates
df.loc['2020-09-03 01:00':'2020-09-06'] #Select range from the index. This is your new Dataframe.
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000],
'Duration':['30days','50days','55days','40days','60days','35days','55days'],
'Discount':[1000,2300,1000,1200,2500,1300,1400],
'InsertedDates':["2021-11-14","2021-11-15","2021-11-16","2021-11-17","2021-11-18","2021-11-19","2021-11-20"]
})
df = pd.DataFrame(technologies)
print(df)
Using pandas.DataFrame.loc to Filter Rows by Dates
Method 1:
mask = (df['InsertedDates'] > start_date) & (df['InsertedDates'] <= end_date)
df2 = df.loc[mask]
print(df2)
Method 2:
start_date = '2021-11-15'
end_date = '2021-11-19'
after_start_date = df["InsertedDates"] >= start_date
before_end_date = df["InsertedDates"] <= end_date
between_two_dates = after_start_date & before_end_date
df2 = df.loc[between_two_dates]
print(df2)
Using pandas.DataFrame.query() to select DataFrame Rows
start_date = '2021-11-15'
end_date = '2021-11-18'
df2 = df.query('InsertedDates >= #start_date and InsertedDates <= #end_date')
print(df2)
Select rows between two dates using DataFrame.query()
start_date = '2021-11-15'
end_date = '2021-11-18'
df2 = df.query('InsertedDates > #start_date and InsertedDates < #end_date')
print(df2)
pandas.Series.between() function Using two dates
df2 = df.loc[df["InsertedDates"].between("2021-11-16", "2021-11-18")]
print(df2)
Select DataFrame rows between two dates using DataFrame.isin()
df2 = df[df["InsertedDates"].isin(pd.date_range("2021-11-15", "2021-11-17"))]
print(df2)
you can do it with pd.date_range() and Timestamp.
Let's say you have read a csv file with a date column using parse_dates option:
df = pd.read_csv('my_file.csv', parse_dates=['my_date_col'])
Then you can define a date range index :
rge = pd.date_range(end='15/6/2020', periods=2)
and then filter your values by date thanks to a map:
df.loc[df['my_date_col'].map(lambda row: row.date() in rge)]
I am creating a DataFrame from a csv as follows:
stock = pd.read_csv('data_in/' + filename + '.csv', skipinitialspace=True)
The DataFrame has a date column. Is there a way to create a new DataFrame (or just overwrite the existing one) which only contains rows with date values that fall within a specified date range or between two specified date values?
There are two possible solutions:
Use a boolean mask, then use df.loc[mask]
Set the date column as a DatetimeIndex, then use df[start_date : end_date]
Using a boolean mask:
Ensure df['date'] is a Series with dtype datetime64[ns]:
df['date'] = pd.to_datetime(df['date'])
Make a boolean mask. start_date and end_date can be datetime.datetimes,
np.datetime64s, pd.Timestamps, or even datetime strings:
#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)
Select the sub-DataFrame:
df.loc[mask]
or re-assign to df
df = df.loc[mask]
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])
yields
0 1 2 date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10
Using a DatetimeIndex:
If you are going to do a lot of selections by date, it may be quicker to set the
date column as the index first. Then you can select rows by date using
df.loc[start_date:end_date].
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])
yields
0 1 2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337
While Python list indexing, e.g. seq[start:end] includes start but not end, in contrast, Pandas df.loc[start_date : end_date] includes both end-points in the result if they are in the index. Neither start_date nor end_date has to be in the index however.
Also note that pd.read_csv has a parse_dates parameter which you could use to parse the date column as datetime64s. Thus, if you use parse_dates, you would not need to use df['date'] = pd.to_datetime(df['date']).
I feel the best option will be to use the direct checks rather than using loc function:
df = df[(df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')]
It works for me.
Major issue with loc function with a slice is that the limits should be present in the actual values, if not this will result in KeyError.
You can also use between:
df[df.some_date.between(start_date, end_date)]
You can use the isin method on the date column like so
df[df["date"].isin(pd.date_range(start_date, end_date))]
Note: This only works with dates (as the question asks) and not timestamps.
Example:
import numpy as np
import pandas as pd
# Make a DataFrame with dates and random numbers
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
# Select the rows between two dates
in_range_df = df[df["date"].isin(pd.date_range("2017-01-15", "2017-01-20"))]
print(in_range_df) # print result
which gives
0 1 2 date
14 0.960974 0.144271 0.839593 2017-01-15
15 0.814376 0.723757 0.047840 2017-01-16
16 0.911854 0.123130 0.120995 2017-01-17
17 0.505804 0.416935 0.928514 2017-01-18
18 0.204869 0.708258 0.170792 2017-01-19
19 0.014389 0.214510 0.045201 2017-01-20
Keeping the solution simple and pythonic, I would suggest you to try this.
In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates.
import pandas as pd
data_frame = data_frame.set_index('date')
df = data_frame[(data_frame.index > '2017-08-10') & (data_frame.index <= '2017-08-15')]
pandas 0.22 has a between() function.
Makes answering this question easier and more readable code.
# create a single column DataFrame with dates going from Jan 1st 2018 to Jan 1st 2019
df = pd.DataFrame({'dates':pd.date_range('2018-01-01','2019-01-01')})
Let's say you want to grab the dates between Nov 27th 2018 and Jan 15th 2019:
# use the between statement to get a boolean mask
df['dates'].between('2018-11-27','2019-01-15', inclusive=False)
0 False
1 False
2 False
3 False
4 False
# you can pass this boolean mask straight to loc
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=False)]
dates
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
335 2018-12-02
Notice the inclusive argument. very helpful when you want to be explicit about your range. notice when set to True we return Nov 27th of 2018 as well:
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
dates
330 2018-11-27
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
This method is also faster than the previously mentioned isin method:
%%timeit -n 5
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
868 µs ± 164 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
%%timeit -n 5
df.loc[df['dates'].isin(pd.date_range('2018-01-01','2019-01-01'))]
1.53 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
However, it is not faster than the currently accepted answer, provided by unutbu, only if the mask is already created. but if the mask is dynamic and needs to be reassigned over and over, my method may be more efficient:
# already create the mask THEN time the function
start_date = dt.datetime(2018,11,27)
end_date = dt.datetime(2019,1,15)
mask = (df['dates'] > start_date) & (df['dates'] <= end_date)
%%timeit -n 5
df.loc[mask]
191 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
Another option, how to achieve this, is by using pandas.DataFrame.query() method. Let me show you an example on the following data frame called df.
>>> df = pd.DataFrame(np.random.random((5, 1)), columns=['col_1'])
>>> df['date'] = pd.date_range('2020-1-1', periods=5, freq='D')
>>> print(df)
col_1 date
0 0.015198 2020-01-01
1 0.638600 2020-01-02
2 0.348485 2020-01-03
3 0.247583 2020-01-04
4 0.581835 2020-01-05
As an argument, use the condition for filtering like this:
>>> start_date, end_date = '2020-01-02', '2020-01-04'
>>> print(df.query('date >= #start_date and date <= #end_date'))
col_1 date
1 0.244104 2020-01-02
2 0.374775 2020-01-03
3 0.510053 2020-01-04
If you do not want to include boundaries, just change the condition like following:
>>> print(df.query('date > #start_date and date < #end_date'))
col_1 date
2 0.374775 2020-01-03
You can use the method truncate:
dates = pd.date_range('2016-01-01', '2016-01-06', freq='d')
df = pd.DataFrame(index=dates, data={'A': 1})
A
2016-01-01 1
2016-01-02 1
2016-01-03 1
2016-01-04 1
2016-01-05 1
2016-01-06 1
Select data between two dates:
df.truncate(before=pd.Timestamp('2016-01-02'),
after=pd.Timestamp('2016-01-4'))
Output:
A
2016-01-02 1
2016-01-03 1
2016-01-04 1
It is highly recommended to convert a date column to an index. Doing that will give a lot of facilities. One is to select the rows between two dates easily, you can see this example:
import numpy as np
import pandas as pd
# Dataframe with monthly data between 2016 - 2020
df = pd.DataFrame(np.random.random((60, 3)))
df['date'] = pd.date_range('2016-1-1', periods=60, freq='M')
To select the rows between 2017-01-01 and 2019-01-01, you need only to convert the date column to an index:
df.set_index('date', inplace=True)
and then only slicing:
df.loc['2017':'2019']
You can select the date column as index while reading the csv file directly instead of the df.set_index():
df = pd.read_csv('file_name.csv',index_col='date')
I prefer not to alter the df.
An option is to retrieve the index of the start and end dates:
import numpy as np
import pandas as pd
#Dummy DataFrame
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
#Get the index of the start and end dates respectively
start = df[df['date']=='2017-01-07'].index[0]
end = df[df['date']=='2017-01-14'].index[0]
#Show the sliced df (from 2017-01-07 to 2017-01-14)
df.loc[start:end]
which results in:
0 1 2 date
6 0.5 0.8 0.8 2017-01-07
7 0.0 0.7 0.3 2017-01-08
8 0.8 0.9 0.0 2017-01-09
9 0.0 0.2 1.0 2017-01-10
10 0.6 0.1 0.9 2017-01-11
11 0.5 0.3 0.9 2017-01-12
12 0.5 0.4 0.3 2017-01-13
13 0.4 0.9 0.9 2017-01-14
Inspired by unutbu
print(df.dtypes) #Make sure the format is 'object'. Rerunning this after index will not show values.
columnName = 'YourColumnName'
df[columnName+'index'] = df[columnName] #Create a new column for index
df.set_index(columnName+'index', inplace=True) #To build index on the timestamp/dates
df.loc['2020-09-03 01:00':'2020-09-06'] #Select range from the index. This is your new Dataframe.
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000],
'Duration':['30days','50days','55days','40days','60days','35days','55days'],
'Discount':[1000,2300,1000,1200,2500,1300,1400],
'InsertedDates':["2021-11-14","2021-11-15","2021-11-16","2021-11-17","2021-11-18","2021-11-19","2021-11-20"]
})
df = pd.DataFrame(technologies)
print(df)
Using pandas.DataFrame.loc to Filter Rows by Dates
Method 1:
mask = (df['InsertedDates'] > start_date) & (df['InsertedDates'] <= end_date)
df2 = df.loc[mask]
print(df2)
Method 2:
start_date = '2021-11-15'
end_date = '2021-11-19'
after_start_date = df["InsertedDates"] >= start_date
before_end_date = df["InsertedDates"] <= end_date
between_two_dates = after_start_date & before_end_date
df2 = df.loc[between_two_dates]
print(df2)
Using pandas.DataFrame.query() to select DataFrame Rows
start_date = '2021-11-15'
end_date = '2021-11-18'
df2 = df.query('InsertedDates >= #start_date and InsertedDates <= #end_date')
print(df2)
Select rows between two dates using DataFrame.query()
start_date = '2021-11-15'
end_date = '2021-11-18'
df2 = df.query('InsertedDates > #start_date and InsertedDates < #end_date')
print(df2)
pandas.Series.between() function Using two dates
df2 = df.loc[df["InsertedDates"].between("2021-11-16", "2021-11-18")]
print(df2)
Select DataFrame rows between two dates using DataFrame.isin()
df2 = df[df["InsertedDates"].isin(pd.date_range("2021-11-15", "2021-11-17"))]
print(df2)
you can do it with pd.date_range() and Timestamp.
Let's say you have read a csv file with a date column using parse_dates option:
df = pd.read_csv('my_file.csv', parse_dates=['my_date_col'])
Then you can define a date range index :
rge = pd.date_range(end='15/6/2020', periods=2)
and then filter your values by date thanks to a map:
df.loc[df['my_date_col'].map(lambda row: row.date() in rge)]
This question already has answers here:
Calculate Time Difference Between Two Pandas Columns in Hours and Minutes
(4 answers)
Closed 2 years ago.
I am trying to find the time difference between two columns of the following frame:
Test Date | Test Type | First Use Date
I used the following function definition to get the difference:
def days_between(d1, d2):
d1 = datetime.strptime(d1, "%Y-%m-%d")
d2 = datetime.strptime(d2, "%Y-%m-%d")
return abs((d2 - d1).days)
And it works fine, however it does not take a series as an input. So I had to construct a for loop that loops over indices:
age_veh = []
for i in range(0, len(data_manufacturer)-1):
age_veh[i].append(days_between(data_manufacturer.iloc[i,0], data_manufacturer.iloc[i,4]))
However, it does return an error:
IndexError: list index out of range
I don't know whether it's the right way of doing and what am I doing wrong or an alternative solution will be much appreciated. Please also bear in mind that I have around 2 mil rows.
Convert the columns using to_datetime then you can subtract the columns to produce a timedelta on the abs values, then you can call dt.days to get the total number of days, example:
In [119]:
import io
import pandas as pd
t="""Test Date,Test Type,First Use Date
2011-02-05,A,2010-01-05
2012-02-05,A,2010-03-05
2013-02-05,A,2010-06-05
2014-02-05,A,2010-08-05"""
df = pd.read_csv(io.StringIO(t))
df
Out[119]:
Test Date Test Type First Use Date
0 2011-02-05 A 2010-01-05
1 2012-02-05 A 2010-03-05
2 2013-02-05 A 2010-06-05
3 2014-02-05 A 2010-08-05
In [121]:
df['Test Date'] = pd.to_datetime(df['Test Date'])
df['First Use Date'] = pd.to_datetime(df['First Use Date'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 3 columns):
Test Date 4 non-null datetime64[ns]
Test Type 4 non-null object
First Use Date 4 non-null datetime64[ns]
dtypes: datetime64[ns](2), object(1)
memory usage: 128.0+ bytes
In [122]:
df['days'] = (df['Test Date'] - df['First Use Date']).abs().dt.days
df
Out[122]:
Test Date Test Type First Use Date days
0 2011-02-05 A 2010-01-05 396
1 2012-02-05 A 2010-03-05 702
2 2013-02-05 A 2010-06-05 976
3 2014-02-05 A 2010-08-05 1280
IIUC you can first convert columns to_datetime, use abs and then convert timedelta to days:
print df
id value date1 date2 sum
0 A 150 2014-04-08 2014-03-08 NaN
1 B 100 2014-05-08 2014-02-08 NaN
2 B 200 2014-01-08 2014-07-08 100
3 A 200 2014-04-08 2014-03-08 NaN
4 A 300 2014-06-08 2014-04-08 350
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
df['diff'] = (df['date1'] - df['date2']).abs() / np.timedelta64(1, 'D')
print df
id value date1 date2 sum diff
0 A 150 2014-04-08 2014-03-08 NaN 31
1 B 100 2014-05-08 2014-02-08 NaN 89
2 B 200 2014-01-08 2014-07-08 100 181
3 A 200 2014-04-08 2014-03-08 NaN 31
4 A 300 2014-06-08 2014-04-08 350 61
EDIT:
I think better is use for converting np.timedelta64(1, 'D') to days in larger DataFrames, because it is faster:
I use EdChum sample, only len(df) = 4k:
import io
import pandas as pd
import numpy as np
t=u"""Test Date,Test Type,First Use Date
2011-02-05,A,2010-01-05
2012-02-05,A,2010-03-05
2013-02-05,A,2010-06-05
2014-02-05,A,2010-08-05"""
df = pd.read_csv(io.StringIO(t))
df = pd.concat([df]*1000).reset_index(drop=True)
df['Test Date'] = pd.to_datetime(df['Test Date'])
df['First Use Date'] = pd.to_datetime(df['First Use Date'])
print (df['Test Date'] - df['First Use Date']).abs().dt.days
print (df['Test Date'] - df['First Use Date']).abs() / np.timedelta64(1, 'D')
Timings:
In [174]: %timeit (df['Test Date'] - df['First Use Date']).abs().dt.days
10 loops, best of 3: 38.8 ms per loop
In [175]: %timeit (df['Test Date'] - df['First Use Date']).abs() / np.timedelta64(1, 'D')
1000 loops, best of 3: 1.62 ms per loop
I have a series with some datetimes (as strings) and some nulls as 'nan':
import pandas as pd, numpy as np, datetime as dt
df = pd.DataFrame({'Date':['2014-10-20 10:44:31', '2014-10-23 09:33:46', 'nan', '2014-10-01 09:38:45']})
I'm trying to convert these to datetime:
df['Date'] = df['Date'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
but I get the error:
time data 'nan' does not match format '%Y-%m-%d %H:%M:%S'
So I try to turn these into actual nulls:
df.ix[df['Date'] == 'nan', 'Date'] = np.NaN
and repeat:
df['Date'] = df['Date'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
but then I get the error:
must be string, not float
What is the quickest way to solve this problem?
Just use to_datetime and set errors='coerce' to handle duff data:
In [321]:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df
Out[321]:
Date
0 2014-10-20 10:44:31
1 2014-10-23 09:33:46
2 NaT
3 2014-10-01 09:38:45
In [322]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 1 columns):
Date 3 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 64.0 bytes
the problem with calling strptime is that it will raise an error if the string, or dtype is incorrect.
If you did this then it would work:
In [324]:
def func(x):
try:
return dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
except:
return pd.NaT
df['Date'].apply(func)
Out[324]:
0 2014-10-20 10:44:31
1 2014-10-23 09:33:46
2 NaT
3 2014-10-01 09:38:45
Name: Date, dtype: datetime64[ns]
but it will be faster to use the inbuilt to_datetime rather than call apply which essentially just loops over your series.
timings
In [326]:
%timeit pd.to_datetime(df['Date'], errors='coerce')
%timeit df['Date'].apply(func)
10000 loops, best of 3: 65.8 µs per loop
10000 loops, best of 3: 186 µs per loop
We see here that using to_datetime is 3X faster.
I find letting pandas do the work to be too slow on large dataframes. In another post I learned of a technique that speeds this up dramatically when the number of unique values is much smaller than the number of rows. (My data is usually stock price or trade blotter data.) It first builds a dict that maps the text dates to their datetime objects, then applies the dict to convert the column of text dates.
def str2time(val):
try:
return dt.datetime.strptime(val, '%H:%M:%S.%f')
except:
return pd.NaT
def TextTime2Time(s):
times = {t : str2time(t) for t in s.unique()}
return s.apply(lambda v: times[v])
df.date = TextTime2Time(df.date)
I have an irregularly indexed time series of data with seconds resolution like:
import pandas as pd
idx = ['2012-01-01 12:43:35', '2012-03-12 15:46:43',
'2012-09-26 18:35:11', '2012-11-11 2:34:59']
status = [1, 0, 1, 0]
df = pd.DataFrame(status, index=idx, columns = ['status'])
df = df.reindex(pd.to_datetime(df.index))
In [62]: df
Out[62]:
status
2012-01-01 12:43:35 1
2012-03-12 15:46:43 0
2012-09-26 18:35:11 1
2012-11-11 02:34:59 0
and I am interested in the fraction of the year when the status is 1. The way I currently do it is that I reindex df with every second in the year and use forward filling like:
full_idx = pd.date_range(start = '1/1/2012', end = '12/31/2012', freq='s')
df1 = df.reindex(full_idx, method='ffill')
which returns a DataFrame that contains every second for the year which I can then calculate the mean for, to see the percentage of time in the 1 status like:
In [66]: df1
Out[66]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 31536001 entries, 2012-01-01 00:00:00 to 2012-12-31 00:00:00
Freq: S
Data columns:
status 31490186 non-null values
dtypes: float64(1)
In [67]: df1.status.mean()
Out[67]: 0.31953371123308066
The problem is that I have to do this for a lot of data, and reindexing it for every second in the year is most expensive operation by far.
What are better ways to do this?
There doesn't seem to be a pandas method to compute time differences between entries of an irregular time series, though there is a convenience method to convert a time series index to an array of datetime.datetime objects, which can be converted to datetime.timedelta objects through subtraction.
In [6]: start_end = pd.DataFrame({'status': [0, 0]},
index=[pd.datetools.parse('1/1/2012'),
pd.datetools.parse('12/31/2012')])
In [7]: df = df.append(start_end).sort()
In [8]: df
Out[8]:
status
2012-01-01 00:00:00 0
2012-01-01 12:43:35 1
2012-03-12 15:46:43 0
2012-09-26 18:35:11 1
2012-11-11 02:34:59 0
2012-12-31 00:00:00 0
In [9]: pydatetime = pd.Series(df.index.to_pydatetime(), index=df.index)
In [11]: df['duration'] = pydatetime.diff().shift(-1).\
map(datetime.timedelta.total_seconds, na_action='ignore')
In [16]: df
Out[16]:
status duration
2012-01-01 00:00:00 0 45815
2012-01-01 12:43:35 1 6145388
2012-03-12 15:46:43 0 17117308
2012-09-26 18:35:11 1 3916788
2012-11-11 02:34:59 0 4310701
2012-12-31 00:00:00 0 NaN
In [17]: (df.status * df.duration).sum() / df.duration.sum()
Out[17]: 0.31906950786402843
Note:
Our answers seem to differ because I set status before the first timestamp to zero, while those entries are NA in your df1 as there's no start value to forward fill and NA values are excluded by pandas mean().
timedelta.total_seconds() is new in Python 2.7.
Timing comparison of this method versus reindexing:
In [8]: timeit delta_method(df)
1000 loops, best of 3: 1.3 ms per loop
In [9]: timeit redindexing(df)
1 loops, best of 3: 2.78 s per loop
Another potential approach is to use traces.
import traces
from dateutil.parser import parse as date_parse
idx = ['2012-01-01 12:43:35', '2012-03-12 15:46:43',
'2012-09-26 18:35:11', '2012-11-11 2:34:59']
status = [1, 0, 1, 0]
# create a TimeSeries from date strings and status
ts = traces.TimeSeries(default=0)
for date_string, status_value in zip(idx, status):
ts[date_parse(date_string)] = status_value
# compute distribution
ts.distribution(
start=date_parse('2012-01-01'),
end=date_parse('2013-01-01'),
)
# {0: 0.6818022667476219, 1: 0.31819773325237805}
The value is calculated between the start of January 1, 2012 and end of December 31, 2012 (equivalently the start of January 1, 2013) without resampling, and assuming the status is 0 at the start of the year (the default=0 parameter)
Timing results:
In [2]: timeit ts.distribution(
start=date_parse('2012-01-01'),
end=date_parse('2013-01-01')
)
619 µs ± 7.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)