Plot scatter for range of dates in matplotlib [duplicate] - python
I am creating a DataFrame from a csv as follows:
stock = pd.read_csv('data_in/' + filename + '.csv', skipinitialspace=True)
The DataFrame has a date column. Is there a way to create a new DataFrame (or just overwrite the existing one) which only contains rows with date values that fall within a specified date range or between two specified date values?
There are two possible solutions:
Use a boolean mask, then use df.loc[mask]
Set the date column as a DatetimeIndex, then use df[start_date : end_date]
Using a boolean mask:
Ensure df['date'] is a Series with dtype datetime64[ns]:
df['date'] = pd.to_datetime(df['date'])
Make a boolean mask. start_date and end_date can be datetime.datetimes,
np.datetime64s, pd.Timestamps, or even datetime strings:
#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)
Select the sub-DataFrame:
df.loc[mask]
or re-assign to df
df = df.loc[mask]
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])
yields
0 1 2 date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10
Using a DatetimeIndex:
If you are going to do a lot of selections by date, it may be quicker to set the
date column as the index first. Then you can select rows by date using
df.loc[start_date:end_date].
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])
yields
0 1 2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337
While Python list indexing, e.g. seq[start:end] includes start but not end, in contrast, Pandas df.loc[start_date : end_date] includes both end-points in the result if they are in the index. Neither start_date nor end_date has to be in the index however.
Also note that pd.read_csv has a parse_dates parameter which you could use to parse the date column as datetime64s. Thus, if you use parse_dates, you would not need to use df['date'] = pd.to_datetime(df['date']).
I feel the best option will be to use the direct checks rather than using loc function:
df = df[(df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')]
It works for me.
Major issue with loc function with a slice is that the limits should be present in the actual values, if not this will result in KeyError.
You can also use between:
df[df.some_date.between(start_date, end_date)]
You can use the isin method on the date column like so
df[df["date"].isin(pd.date_range(start_date, end_date))]
Note: This only works with dates (as the question asks) and not timestamps.
Example:
import numpy as np
import pandas as pd
# Make a DataFrame with dates and random numbers
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
# Select the rows between two dates
in_range_df = df[df["date"].isin(pd.date_range("2017-01-15", "2017-01-20"))]
print(in_range_df) # print result
which gives
0 1 2 date
14 0.960974 0.144271 0.839593 2017-01-15
15 0.814376 0.723757 0.047840 2017-01-16
16 0.911854 0.123130 0.120995 2017-01-17
17 0.505804 0.416935 0.928514 2017-01-18
18 0.204869 0.708258 0.170792 2017-01-19
19 0.014389 0.214510 0.045201 2017-01-20
Keeping the solution simple and pythonic, I would suggest you to try this.
In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates.
import pandas as pd
data_frame = data_frame.set_index('date')
df = data_frame[(data_frame.index > '2017-08-10') & (data_frame.index <= '2017-08-15')]
pandas 0.22 has a between() function.
Makes answering this question easier and more readable code.
# create a single column DataFrame with dates going from Jan 1st 2018 to Jan 1st 2019
df = pd.DataFrame({'dates':pd.date_range('2018-01-01','2019-01-01')})
Let's say you want to grab the dates between Nov 27th 2018 and Jan 15th 2019:
# use the between statement to get a boolean mask
df['dates'].between('2018-11-27','2019-01-15', inclusive=False)
0 False
1 False
2 False
3 False
4 False
# you can pass this boolean mask straight to loc
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=False)]
dates
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
335 2018-12-02
Notice the inclusive argument. very helpful when you want to be explicit about your range. notice when set to True we return Nov 27th of 2018 as well:
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
dates
330 2018-11-27
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
This method is also faster than the previously mentioned isin method:
%%timeit -n 5
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
868 µs ± 164 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
%%timeit -n 5
df.loc[df['dates'].isin(pd.date_range('2018-01-01','2019-01-01'))]
1.53 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
However, it is not faster than the currently accepted answer, provided by unutbu, only if the mask is already created. but if the mask is dynamic and needs to be reassigned over and over, my method may be more efficient:
# already create the mask THEN time the function
start_date = dt.datetime(2018,11,27)
end_date = dt.datetime(2019,1,15)
mask = (df['dates'] > start_date) & (df['dates'] <= end_date)
%%timeit -n 5
df.loc[mask]
191 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
Another option, how to achieve this, is by using pandas.DataFrame.query() method. Let me show you an example on the following data frame called df.
>>> df = pd.DataFrame(np.random.random((5, 1)), columns=['col_1'])
>>> df['date'] = pd.date_range('2020-1-1', periods=5, freq='D')
>>> print(df)
col_1 date
0 0.015198 2020-01-01
1 0.638600 2020-01-02
2 0.348485 2020-01-03
3 0.247583 2020-01-04
4 0.581835 2020-01-05
As an argument, use the condition for filtering like this:
>>> start_date, end_date = '2020-01-02', '2020-01-04'
>>> print(df.query('date >= #start_date and date <= #end_date'))
col_1 date
1 0.244104 2020-01-02
2 0.374775 2020-01-03
3 0.510053 2020-01-04
If you do not want to include boundaries, just change the condition like following:
>>> print(df.query('date > #start_date and date < #end_date'))
col_1 date
2 0.374775 2020-01-03
You can use the method truncate:
dates = pd.date_range('2016-01-01', '2016-01-06', freq='d')
df = pd.DataFrame(index=dates, data={'A': 1})
A
2016-01-01 1
2016-01-02 1
2016-01-03 1
2016-01-04 1
2016-01-05 1
2016-01-06 1
Select data between two dates:
df.truncate(before=pd.Timestamp('2016-01-02'),
after=pd.Timestamp('2016-01-4'))
Output:
A
2016-01-02 1
2016-01-03 1
2016-01-04 1
It is highly recommended to convert a date column to an index. Doing that will give a lot of facilities. One is to select the rows between two dates easily, you can see this example:
import numpy as np
import pandas as pd
# Dataframe with monthly data between 2016 - 2020
df = pd.DataFrame(np.random.random((60, 3)))
df['date'] = pd.date_range('2016-1-1', periods=60, freq='M')
To select the rows between 2017-01-01 and 2019-01-01, you need only to convert the date column to an index:
df.set_index('date', inplace=True)
and then only slicing:
df.loc['2017':'2019']
You can select the date column as index while reading the csv file directly instead of the df.set_index():
df = pd.read_csv('file_name.csv',index_col='date')
I prefer not to alter the df.
An option is to retrieve the index of the start and end dates:
import numpy as np
import pandas as pd
#Dummy DataFrame
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
#Get the index of the start and end dates respectively
start = df[df['date']=='2017-01-07'].index[0]
end = df[df['date']=='2017-01-14'].index[0]
#Show the sliced df (from 2017-01-07 to 2017-01-14)
df.loc[start:end]
which results in:
0 1 2 date
6 0.5 0.8 0.8 2017-01-07
7 0.0 0.7 0.3 2017-01-08
8 0.8 0.9 0.0 2017-01-09
9 0.0 0.2 1.0 2017-01-10
10 0.6 0.1 0.9 2017-01-11
11 0.5 0.3 0.9 2017-01-12
12 0.5 0.4 0.3 2017-01-13
13 0.4 0.9 0.9 2017-01-14
Inspired by unutbu
print(df.dtypes) #Make sure the format is 'object'. Rerunning this after index will not show values.
columnName = 'YourColumnName'
df[columnName+'index'] = df[columnName] #Create a new column for index
df.set_index(columnName+'index', inplace=True) #To build index on the timestamp/dates
df.loc['2020-09-03 01:00':'2020-09-06'] #Select range from the index. This is your new Dataframe.
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000],
'Duration':['30days','50days','55days','40days','60days','35days','55days'],
'Discount':[1000,2300,1000,1200,2500,1300,1400],
'InsertedDates':["2021-11-14","2021-11-15","2021-11-16","2021-11-17","2021-11-18","2021-11-19","2021-11-20"]
})
df = pd.DataFrame(technologies)
print(df)
Using pandas.DataFrame.loc to Filter Rows by Dates
Method 1:
mask = (df['InsertedDates'] > start_date) & (df['InsertedDates'] <= end_date)
df2 = df.loc[mask]
print(df2)
Method 2:
start_date = '2021-11-15'
end_date = '2021-11-19'
after_start_date = df["InsertedDates"] >= start_date
before_end_date = df["InsertedDates"] <= end_date
between_two_dates = after_start_date & before_end_date
df2 = df.loc[between_two_dates]
print(df2)
Using pandas.DataFrame.query() to select DataFrame Rows
start_date = '2021-11-15'
end_date = '2021-11-18'
df2 = df.query('InsertedDates >= #start_date and InsertedDates <= #end_date')
print(df2)
Select rows between two dates using DataFrame.query()
start_date = '2021-11-15'
end_date = '2021-11-18'
df2 = df.query('InsertedDates > #start_date and InsertedDates < #end_date')
print(df2)
pandas.Series.between() function Using two dates
df2 = df.loc[df["InsertedDates"].between("2021-11-16", "2021-11-18")]
print(df2)
Select DataFrame rows between two dates using DataFrame.isin()
df2 = df[df["InsertedDates"].isin(pd.date_range("2021-11-15", "2021-11-17"))]
print(df2)
you can do it with pd.date_range() and Timestamp.
Let's say you have read a csv file with a date column using parse_dates option:
df = pd.read_csv('my_file.csv', parse_dates=['my_date_col'])
Then you can define a date range index :
rge = pd.date_range(end='15/6/2020', periods=2)
and then filter your values by date thanks to a map:
df.loc[df['my_date_col'].map(lambda row: row.date() in rge)]
Related
How to set values in dataframe to a value before every date in every year [duplicate]
I've got some daily data in a Pandas DataFrame and it has a nice index. Something like this: import pandas as pd import numpy as np rng = pd.date_range('1/1/2010', periods=1000, freq='D') ts = pd.DataFrame(randn(len(rng)), index=rng, columns=['vals']) print ts.head() vals 2010-01-01 1.098302 2010-01-02 -1.384821 2010-01-03 -0.426329 2010-01-04 -0.587967 2010-01-05 -0.853374 I'd like to subset my DataFrame to only the records that fall between February 2 & March 3 for all years. It seems there should be a very Pandas-esque way of doing this but I'm struggling to find it. Any help?
I don't think there is a native way to do this (there is with between times). But you can do it naively (this will be efficient, but is a pain to write!): In [11]: ts[((ts.index.month == 2) & (2 <= ts.index.day) # in Feb after the 2nd inclusive | (ts.index.month == 3) & (ts.index.day <= 3))] # in March before the 3rd inclusive Out[11]: <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 94 entries, 2010-02-01 00:00:00 to 2012-03-03 00:00:00 Data columns (total 1 columns): vals 94 non-null values dtypes: float64(1)
To select all records of an annual returning period covering multiple months, do as follow: rng = pd.date_range('2010-1-1', periods=1000, freq='D') df = pd.DataFrame(np.random.randn(len(rng)), index=rng, columns=['A']) startMM, startdd = (2,15) # Feb 15th endMM, enddd = (10,3) # Oct 3rd month_day = pd.concat([ df.index.to_series().dt.month, df.index.to_series().dt.day ], axis=1).apply(tuple, axis=1) df[(month_day >= (startMM, startdd)) & (month_day <= (endMM, enddd))] as mentioned by #IanS in https://stackoverflow.com/a/45996897/2459096
Filter data by Year and month range [duplicate]
I am creating a DataFrame from a csv as follows: stock = pd.read_csv('data_in/' + filename + '.csv', skipinitialspace=True) The DataFrame has a date column. Is there a way to create a new DataFrame (or just overwrite the existing one) which only contains rows with date values that fall within a specified date range or between two specified date values?
There are two possible solutions: Use a boolean mask, then use df.loc[mask] Set the date column as a DatetimeIndex, then use df[start_date : end_date] Using a boolean mask: Ensure df['date'] is a Series with dtype datetime64[ns]: df['date'] = pd.to_datetime(df['date']) Make a boolean mask. start_date and end_date can be datetime.datetimes, np.datetime64s, pd.Timestamps, or even datetime strings: #greater than the start date and smaller than the end date mask = (df['date'] > start_date) & (df['date'] <= end_date) Select the sub-DataFrame: df.loc[mask] or re-assign to df df = df.loc[mask] For example, import numpy as np import pandas as pd df = pd.DataFrame(np.random.random((200,3))) df['date'] = pd.date_range('2000-1-1', periods=200, freq='D') mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10') print(df.loc[mask]) yields 0 1 2 date 153 0.208875 0.727656 0.037787 2000-06-02 154 0.750800 0.776498 0.237716 2000-06-03 155 0.812008 0.127338 0.397240 2000-06-04 156 0.639937 0.207359 0.533527 2000-06-05 157 0.416998 0.845658 0.872826 2000-06-06 158 0.440069 0.338690 0.847545 2000-06-07 159 0.202354 0.624833 0.740254 2000-06-08 160 0.465746 0.080888 0.155452 2000-06-09 161 0.858232 0.190321 0.432574 2000-06-10 Using a DatetimeIndex: If you are going to do a lot of selections by date, it may be quicker to set the date column as the index first. Then you can select rows by date using df.loc[start_date:end_date]. import numpy as np import pandas as pd df = pd.DataFrame(np.random.random((200,3))) df['date'] = pd.date_range('2000-1-1', periods=200, freq='D') df = df.set_index(['date']) print(df.loc['2000-6-1':'2000-6-10']) yields 0 1 2 date 2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date 2000-06-02 0.279323 0.877446 0.464523 2000-06-03 0.328068 0.837669 0.608559 2000-06-04 0.107959 0.678297 0.517435 2000-06-05 0.131555 0.418380 0.025725 2000-06-06 0.999961 0.619517 0.206108 2000-06-07 0.129270 0.024533 0.154769 2000-06-08 0.441010 0.741781 0.470402 2000-06-09 0.682101 0.375660 0.009916 2000-06-10 0.754488 0.352293 0.339337 While Python list indexing, e.g. seq[start:end] includes start but not end, in contrast, Pandas df.loc[start_date : end_date] includes both end-points in the result if they are in the index. Neither start_date nor end_date has to be in the index however. Also note that pd.read_csv has a parse_dates parameter which you could use to parse the date column as datetime64s. Thus, if you use parse_dates, you would not need to use df['date'] = pd.to_datetime(df['date']).
I feel the best option will be to use the direct checks rather than using loc function: df = df[(df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')] It works for me. Major issue with loc function with a slice is that the limits should be present in the actual values, if not this will result in KeyError.
You can also use between: df[df.some_date.between(start_date, end_date)]
You can use the isin method on the date column like so df[df["date"].isin(pd.date_range(start_date, end_date))] Note: This only works with dates (as the question asks) and not timestamps. Example: import numpy as np import pandas as pd # Make a DataFrame with dates and random numbers df = pd.DataFrame(np.random.random((30, 3))) df['date'] = pd.date_range('2017-1-1', periods=30, freq='D') # Select the rows between two dates in_range_df = df[df["date"].isin(pd.date_range("2017-01-15", "2017-01-20"))] print(in_range_df) # print result which gives 0 1 2 date 14 0.960974 0.144271 0.839593 2017-01-15 15 0.814376 0.723757 0.047840 2017-01-16 16 0.911854 0.123130 0.120995 2017-01-17 17 0.505804 0.416935 0.928514 2017-01-18 18 0.204869 0.708258 0.170792 2017-01-19 19 0.014389 0.214510 0.045201 2017-01-20
Keeping the solution simple and pythonic, I would suggest you to try this. In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates. import pandas as pd data_frame = data_frame.set_index('date') df = data_frame[(data_frame.index > '2017-08-10') & (data_frame.index <= '2017-08-15')]
pandas 0.22 has a between() function. Makes answering this question easier and more readable code. # create a single column DataFrame with dates going from Jan 1st 2018 to Jan 1st 2019 df = pd.DataFrame({'dates':pd.date_range('2018-01-01','2019-01-01')}) Let's say you want to grab the dates between Nov 27th 2018 and Jan 15th 2019: # use the between statement to get a boolean mask df['dates'].between('2018-11-27','2019-01-15', inclusive=False) 0 False 1 False 2 False 3 False 4 False # you can pass this boolean mask straight to loc df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=False)] dates 331 2018-11-28 332 2018-11-29 333 2018-11-30 334 2018-12-01 335 2018-12-02 Notice the inclusive argument. very helpful when you want to be explicit about your range. notice when set to True we return Nov 27th of 2018 as well: df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)] dates 330 2018-11-27 331 2018-11-28 332 2018-11-29 333 2018-11-30 334 2018-12-01 This method is also faster than the previously mentioned isin method: %%timeit -n 5 df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)] 868 µs ± 164 µs per loop (mean ± std. dev. of 7 runs, 5 loops each) %%timeit -n 5 df.loc[df['dates'].isin(pd.date_range('2018-01-01','2019-01-01'))] 1.53 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 5 loops each) However, it is not faster than the currently accepted answer, provided by unutbu, only if the mask is already created. but if the mask is dynamic and needs to be reassigned over and over, my method may be more efficient: # already create the mask THEN time the function start_date = dt.datetime(2018,11,27) end_date = dt.datetime(2019,1,15) mask = (df['dates'] > start_date) & (df['dates'] <= end_date) %%timeit -n 5 df.loc[mask] 191 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
Another option, how to achieve this, is by using pandas.DataFrame.query() method. Let me show you an example on the following data frame called df. >>> df = pd.DataFrame(np.random.random((5, 1)), columns=['col_1']) >>> df['date'] = pd.date_range('2020-1-1', periods=5, freq='D') >>> print(df) col_1 date 0 0.015198 2020-01-01 1 0.638600 2020-01-02 2 0.348485 2020-01-03 3 0.247583 2020-01-04 4 0.581835 2020-01-05 As an argument, use the condition for filtering like this: >>> start_date, end_date = '2020-01-02', '2020-01-04' >>> print(df.query('date >= #start_date and date <= #end_date')) col_1 date 1 0.244104 2020-01-02 2 0.374775 2020-01-03 3 0.510053 2020-01-04 If you do not want to include boundaries, just change the condition like following: >>> print(df.query('date > #start_date and date < #end_date')) col_1 date 2 0.374775 2020-01-03
You can use the method truncate: dates = pd.date_range('2016-01-01', '2016-01-06', freq='d') df = pd.DataFrame(index=dates, data={'A': 1}) A 2016-01-01 1 2016-01-02 1 2016-01-03 1 2016-01-04 1 2016-01-05 1 2016-01-06 1 Select data between two dates: df.truncate(before=pd.Timestamp('2016-01-02'), after=pd.Timestamp('2016-01-4')) Output: A 2016-01-02 1 2016-01-03 1 2016-01-04 1
It is highly recommended to convert a date column to an index. Doing that will give a lot of facilities. One is to select the rows between two dates easily, you can see this example: import numpy as np import pandas as pd # Dataframe with monthly data between 2016 - 2020 df = pd.DataFrame(np.random.random((60, 3))) df['date'] = pd.date_range('2016-1-1', periods=60, freq='M') To select the rows between 2017-01-01 and 2019-01-01, you need only to convert the date column to an index: df.set_index('date', inplace=True) and then only slicing: df.loc['2017':'2019'] You can select the date column as index while reading the csv file directly instead of the df.set_index(): df = pd.read_csv('file_name.csv',index_col='date')
I prefer not to alter the df. An option is to retrieve the index of the start and end dates: import numpy as np import pandas as pd #Dummy DataFrame df = pd.DataFrame(np.random.random((30, 3))) df['date'] = pd.date_range('2017-1-1', periods=30, freq='D') #Get the index of the start and end dates respectively start = df[df['date']=='2017-01-07'].index[0] end = df[df['date']=='2017-01-14'].index[0] #Show the sliced df (from 2017-01-07 to 2017-01-14) df.loc[start:end] which results in: 0 1 2 date 6 0.5 0.8 0.8 2017-01-07 7 0.0 0.7 0.3 2017-01-08 8 0.8 0.9 0.0 2017-01-09 9 0.0 0.2 1.0 2017-01-10 10 0.6 0.1 0.9 2017-01-11 11 0.5 0.3 0.9 2017-01-12 12 0.5 0.4 0.3 2017-01-13 13 0.4 0.9 0.9 2017-01-14
Inspired by unutbu print(df.dtypes) #Make sure the format is 'object'. Rerunning this after index will not show values. columnName = 'YourColumnName' df[columnName+'index'] = df[columnName] #Create a new column for index df.set_index(columnName+'index', inplace=True) #To build index on the timestamp/dates df.loc['2020-09-03 01:00':'2020-09-06'] #Select range from the index. This is your new Dataframe.
import pandas as pd technologies = ({ 'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark"], 'Fee' :[22000,25000,23000,24000,26000,25000,25000], 'Duration':['30days','50days','55days','40days','60days','35days','55days'], 'Discount':[1000,2300,1000,1200,2500,1300,1400], 'InsertedDates':["2021-11-14","2021-11-15","2021-11-16","2021-11-17","2021-11-18","2021-11-19","2021-11-20"] }) df = pd.DataFrame(technologies) print(df) Using pandas.DataFrame.loc to Filter Rows by Dates Method 1: mask = (df['InsertedDates'] > start_date) & (df['InsertedDates'] <= end_date) df2 = df.loc[mask] print(df2) Method 2: start_date = '2021-11-15' end_date = '2021-11-19' after_start_date = df["InsertedDates"] >= start_date before_end_date = df["InsertedDates"] <= end_date between_two_dates = after_start_date & before_end_date df2 = df.loc[between_two_dates] print(df2) Using pandas.DataFrame.query() to select DataFrame Rows start_date = '2021-11-15' end_date = '2021-11-18' df2 = df.query('InsertedDates >= #start_date and InsertedDates <= #end_date') print(df2) Select rows between two dates using DataFrame.query() start_date = '2021-11-15' end_date = '2021-11-18' df2 = df.query('InsertedDates > #start_date and InsertedDates < #end_date') print(df2) pandas.Series.between() function Using two dates df2 = df.loc[df["InsertedDates"].between("2021-11-16", "2021-11-18")] print(df2) Select DataFrame rows between two dates using DataFrame.isin() df2 = df[df["InsertedDates"].isin(pd.date_range("2021-11-15", "2021-11-17"))] print(df2)
you can do it with pd.date_range() and Timestamp. Let's say you have read a csv file with a date column using parse_dates option: df = pd.read_csv('my_file.csv', parse_dates=['my_date_col']) Then you can define a date range index : rge = pd.date_range(end='15/6/2020', periods=2) and then filter your values by date thanks to a map: df.loc[df['my_date_col'].map(lambda row: row.date() in rge)]
Append values in pandas where value equals other value
I have two data frames: dfi = pd.read_csv('C:/Users/Mauricio/Desktop/inflation.csv') dfm = pd.read_csv('C:/Users/Mauricio/Desktop/maturity.csv') # equals the following observation_date CPIAUCSL 0 1947-01-01 21.48 1 1947-02-01 21.62 2 1947-03-01 22.00 3 1947-04-01 22.00 4 1947-05-01 21.95 observation_date DGS10 0 1962-01-02 4.06 1 1962-01-03 4.03 2 1962-01-04 3.99 3 1962-01-05 4.02 4 1962-01-08 4.03 I created a copy as df doing the following: df = dfi.copy(deep=True) which returns an exact copy of dfi, dfi dates go by month and dfm dates go by day, I want to create a new column in df that everytime a date in dfi == a date in dfm, to append the DGS10 value in it. I have this so far: for date in df.observation_date: for date2 in dfm.observation_date: if date==date2: df['mat_rate'] = dfm['DGS10'] # this is what I get but dates do not match values observation_date CPIAUCSL mat_rate 0 1947-01-01 21.48 4.06 1 1947-02-01 21.62 4.03 2 1947-03-01 22.00 3.99 3 1947-04-01 22.00 4.02 4 1947-05-01 21.95 4.03 It works but does not append the dates where date == date2 what can I do so it appends the values where date equals date2 only? Thank you!
If the date formats are inconsistent, convert them first: dfi.observation_date = pd.to_datetime(dfi.observation_date, format='%Y-%m-%d') dfm.observation_date = pd.to_datetime(dfm.observation_date, format='%Y-%m-%d') Now, getting your result should be easy with a merge: df = dfi.merge(dfm, on='observation_date')
Finding time difference between two columns in DataFrame [duplicate]
This question already has answers here: Calculate Time Difference Between Two Pandas Columns in Hours and Minutes (4 answers) Closed 2 years ago. I am trying to find the time difference between two columns of the following frame: Test Date | Test Type | First Use Date I used the following function definition to get the difference: def days_between(d1, d2): d1 = datetime.strptime(d1, "%Y-%m-%d") d2 = datetime.strptime(d2, "%Y-%m-%d") return abs((d2 - d1).days) And it works fine, however it does not take a series as an input. So I had to construct a for loop that loops over indices: age_veh = [] for i in range(0, len(data_manufacturer)-1): age_veh[i].append(days_between(data_manufacturer.iloc[i,0], data_manufacturer.iloc[i,4])) However, it does return an error: IndexError: list index out of range I don't know whether it's the right way of doing and what am I doing wrong or an alternative solution will be much appreciated. Please also bear in mind that I have around 2 mil rows.
Convert the columns using to_datetime then you can subtract the columns to produce a timedelta on the abs values, then you can call dt.days to get the total number of days, example: In [119]: import io import pandas as pd t="""Test Date,Test Type,First Use Date 2011-02-05,A,2010-01-05 2012-02-05,A,2010-03-05 2013-02-05,A,2010-06-05 2014-02-05,A,2010-08-05""" df = pd.read_csv(io.StringIO(t)) df Out[119]: Test Date Test Type First Use Date 0 2011-02-05 A 2010-01-05 1 2012-02-05 A 2010-03-05 2 2013-02-05 A 2010-06-05 3 2014-02-05 A 2010-08-05 In [121]: df['Test Date'] = pd.to_datetime(df['Test Date']) df['First Use Date'] = pd.to_datetime(df['First Use Date']) df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 4 entries, 0 to 3 Data columns (total 3 columns): Test Date 4 non-null datetime64[ns] Test Type 4 non-null object First Use Date 4 non-null datetime64[ns] dtypes: datetime64[ns](2), object(1) memory usage: 128.0+ bytes In [122]: df['days'] = (df['Test Date'] - df['First Use Date']).abs().dt.days df Out[122]: Test Date Test Type First Use Date days 0 2011-02-05 A 2010-01-05 396 1 2012-02-05 A 2010-03-05 702 2 2013-02-05 A 2010-06-05 976 3 2014-02-05 A 2010-08-05 1280
IIUC you can first convert columns to_datetime, use abs and then convert timedelta to days: print df id value date1 date2 sum 0 A 150 2014-04-08 2014-03-08 NaN 1 B 100 2014-05-08 2014-02-08 NaN 2 B 200 2014-01-08 2014-07-08 100 3 A 200 2014-04-08 2014-03-08 NaN 4 A 300 2014-06-08 2014-04-08 350 df['date1'] = pd.to_datetime(df['date1']) df['date2'] = pd.to_datetime(df['date2']) df['diff'] = (df['date1'] - df['date2']).abs() / np.timedelta64(1, 'D') print df id value date1 date2 sum diff 0 A 150 2014-04-08 2014-03-08 NaN 31 1 B 100 2014-05-08 2014-02-08 NaN 89 2 B 200 2014-01-08 2014-07-08 100 181 3 A 200 2014-04-08 2014-03-08 NaN 31 4 A 300 2014-06-08 2014-04-08 350 61 EDIT: I think better is use for converting np.timedelta64(1, 'D') to days in larger DataFrames, because it is faster: I use EdChum sample, only len(df) = 4k: import io import pandas as pd import numpy as np t=u"""Test Date,Test Type,First Use Date 2011-02-05,A,2010-01-05 2012-02-05,A,2010-03-05 2013-02-05,A,2010-06-05 2014-02-05,A,2010-08-05""" df = pd.read_csv(io.StringIO(t)) df = pd.concat([df]*1000).reset_index(drop=True) df['Test Date'] = pd.to_datetime(df['Test Date']) df['First Use Date'] = pd.to_datetime(df['First Use Date']) print (df['Test Date'] - df['First Use Date']).abs().dt.days print (df['Test Date'] - df['First Use Date']).abs() / np.timedelta64(1, 'D') Timings: In [174]: %timeit (df['Test Date'] - df['First Use Date']).abs().dt.days 10 loops, best of 3: 38.8 ms per loop In [175]: %timeit (df['Test Date'] - df['First Use Date']).abs() / np.timedelta64(1, 'D') 1000 loops, best of 3: 1.62 ms per loop
Calculate time in certain state for time series data
I have an irregularly indexed time series of data with seconds resolution like: import pandas as pd idx = ['2012-01-01 12:43:35', '2012-03-12 15:46:43', '2012-09-26 18:35:11', '2012-11-11 2:34:59'] status = [1, 0, 1, 0] df = pd.DataFrame(status, index=idx, columns = ['status']) df = df.reindex(pd.to_datetime(df.index)) In [62]: df Out[62]: status 2012-01-01 12:43:35 1 2012-03-12 15:46:43 0 2012-09-26 18:35:11 1 2012-11-11 02:34:59 0 and I am interested in the fraction of the year when the status is 1. The way I currently do it is that I reindex df with every second in the year and use forward filling like: full_idx = pd.date_range(start = '1/1/2012', end = '12/31/2012', freq='s') df1 = df.reindex(full_idx, method='ffill') which returns a DataFrame that contains every second for the year which I can then calculate the mean for, to see the percentage of time in the 1 status like: In [66]: df1 Out[66]: <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 31536001 entries, 2012-01-01 00:00:00 to 2012-12-31 00:00:00 Freq: S Data columns: status 31490186 non-null values dtypes: float64(1) In [67]: df1.status.mean() Out[67]: 0.31953371123308066 The problem is that I have to do this for a lot of data, and reindexing it for every second in the year is most expensive operation by far. What are better ways to do this?
There doesn't seem to be a pandas method to compute time differences between entries of an irregular time series, though there is a convenience method to convert a time series index to an array of datetime.datetime objects, which can be converted to datetime.timedelta objects through subtraction. In [6]: start_end = pd.DataFrame({'status': [0, 0]}, index=[pd.datetools.parse('1/1/2012'), pd.datetools.parse('12/31/2012')]) In [7]: df = df.append(start_end).sort() In [8]: df Out[8]: status 2012-01-01 00:00:00 0 2012-01-01 12:43:35 1 2012-03-12 15:46:43 0 2012-09-26 18:35:11 1 2012-11-11 02:34:59 0 2012-12-31 00:00:00 0 In [9]: pydatetime = pd.Series(df.index.to_pydatetime(), index=df.index) In [11]: df['duration'] = pydatetime.diff().shift(-1).\ map(datetime.timedelta.total_seconds, na_action='ignore') In [16]: df Out[16]: status duration 2012-01-01 00:00:00 0 45815 2012-01-01 12:43:35 1 6145388 2012-03-12 15:46:43 0 17117308 2012-09-26 18:35:11 1 3916788 2012-11-11 02:34:59 0 4310701 2012-12-31 00:00:00 0 NaN In [17]: (df.status * df.duration).sum() / df.duration.sum() Out[17]: 0.31906950786402843 Note: Our answers seem to differ because I set status before the first timestamp to zero, while those entries are NA in your df1 as there's no start value to forward fill and NA values are excluded by pandas mean(). timedelta.total_seconds() is new in Python 2.7. Timing comparison of this method versus reindexing: In [8]: timeit delta_method(df) 1000 loops, best of 3: 1.3 ms per loop In [9]: timeit redindexing(df) 1 loops, best of 3: 2.78 s per loop
Another potential approach is to use traces. import traces from dateutil.parser import parse as date_parse idx = ['2012-01-01 12:43:35', '2012-03-12 15:46:43', '2012-09-26 18:35:11', '2012-11-11 2:34:59'] status = [1, 0, 1, 0] # create a TimeSeries from date strings and status ts = traces.TimeSeries(default=0) for date_string, status_value in zip(idx, status): ts[date_parse(date_string)] = status_value # compute distribution ts.distribution( start=date_parse('2012-01-01'), end=date_parse('2013-01-01'), ) # {0: 0.6818022667476219, 1: 0.31819773325237805} The value is calculated between the start of January 1, 2012 and end of December 31, 2012 (equivalently the start of January 1, 2013) without resampling, and assuming the status is 0 at the start of the year (the default=0 parameter) Timing results: In [2]: timeit ts.distribution( start=date_parse('2012-01-01'), end=date_parse('2013-01-01') ) 619 µs ± 7.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)