Plot scatter for range of dates in matplotlib [duplicate] - python

I am creating a DataFrame from a csv as follows:
stock = pd.read_csv('data_in/' + filename + '.csv', skipinitialspace=True)
The DataFrame has a date column. Is there a way to create a new DataFrame (or just overwrite the existing one) which only contains rows with date values that fall within a specified date range or between two specified date values?

There are two possible solutions:
Use a boolean mask, then use df.loc[mask]
Set the date column as a DatetimeIndex, then use df[start_date : end_date]
Using a boolean mask:
Ensure df['date'] is a Series with dtype datetime64[ns]:
df['date'] = pd.to_datetime(df['date'])
Make a boolean mask. start_date and end_date can be datetime.datetimes,
np.datetime64s, pd.Timestamps, or even datetime strings:
#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)
Select the sub-DataFrame:
df.loc[mask]
or re-assign to df
df = df.loc[mask]
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])
yields
0 1 2 date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10
Using a DatetimeIndex:
If you are going to do a lot of selections by date, it may be quicker to set the
date column as the index first. Then you can select rows by date using
df.loc[start_date:end_date].
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])
yields
0 1 2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337
While Python list indexing, e.g. seq[start:end] includes start but not end, in contrast, Pandas df.loc[start_date : end_date] includes both end-points in the result if they are in the index. Neither start_date nor end_date has to be in the index however.
Also note that pd.read_csv has a parse_dates parameter which you could use to parse the date column as datetime64s. Thus, if you use parse_dates, you would not need to use df['date'] = pd.to_datetime(df['date']).

I feel the best option will be to use the direct checks rather than using loc function:
df = df[(df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')]
It works for me.
Major issue with loc function with a slice is that the limits should be present in the actual values, if not this will result in KeyError.

You can also use between:
df[df.some_date.between(start_date, end_date)]

You can use the isin method on the date column like so
df[df["date"].isin(pd.date_range(start_date, end_date))]
Note: This only works with dates (as the question asks) and not timestamps.
Example:
import numpy as np
import pandas as pd
# Make a DataFrame with dates and random numbers
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
# Select the rows between two dates
in_range_df = df[df["date"].isin(pd.date_range("2017-01-15", "2017-01-20"))]
print(in_range_df) # print result
which gives
0 1 2 date
14 0.960974 0.144271 0.839593 2017-01-15
15 0.814376 0.723757 0.047840 2017-01-16
16 0.911854 0.123130 0.120995 2017-01-17
17 0.505804 0.416935 0.928514 2017-01-18
18 0.204869 0.708258 0.170792 2017-01-19
19 0.014389 0.214510 0.045201 2017-01-20

Keeping the solution simple and pythonic, I would suggest you to try this.
In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates.
import pandas as pd
data_frame = data_frame.set_index('date')
df = data_frame[(data_frame.index > '2017-08-10') & (data_frame.index <= '2017-08-15')]

pandas 0.22 has a between() function.
Makes answering this question easier and more readable code.
# create a single column DataFrame with dates going from Jan 1st 2018 to Jan 1st 2019
df = pd.DataFrame({'dates':pd.date_range('2018-01-01','2019-01-01')})
Let's say you want to grab the dates between Nov 27th 2018 and Jan 15th 2019:
# use the between statement to get a boolean mask
df['dates'].between('2018-11-27','2019-01-15', inclusive=False)
0 False
1 False
2 False
3 False
4 False
# you can pass this boolean mask straight to loc
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=False)]
dates
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
335 2018-12-02
Notice the inclusive argument. very helpful when you want to be explicit about your range. notice when set to True we return Nov 27th of 2018 as well:
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
dates
330 2018-11-27
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
This method is also faster than the previously mentioned isin method:
%%timeit -n 5
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
868 µs ± 164 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
%%timeit -n 5
df.loc[df['dates'].isin(pd.date_range('2018-01-01','2019-01-01'))]
1.53 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
However, it is not faster than the currently accepted answer, provided by unutbu, only if the mask is already created. but if the mask is dynamic and needs to be reassigned over and over, my method may be more efficient:
# already create the mask THEN time the function
start_date = dt.datetime(2018,11,27)
end_date = dt.datetime(2019,1,15)
mask = (df['dates'] > start_date) & (df['dates'] <= end_date)
%%timeit -n 5
df.loc[mask]
191 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

Another option, how to achieve this, is by using pandas.DataFrame.query() method. Let me show you an example on the following data frame called df.
>>> df = pd.DataFrame(np.random.random((5, 1)), columns=['col_1'])
>>> df['date'] = pd.date_range('2020-1-1', periods=5, freq='D')
>>> print(df)
col_1 date
0 0.015198 2020-01-01
1 0.638600 2020-01-02
2 0.348485 2020-01-03
3 0.247583 2020-01-04
4 0.581835 2020-01-05
As an argument, use the condition for filtering like this:
>>> start_date, end_date = '2020-01-02', '2020-01-04'
>>> print(df.query('date >= #start_date and date <= #end_date'))
col_1 date
1 0.244104 2020-01-02
2 0.374775 2020-01-03
3 0.510053 2020-01-04
If you do not want to include boundaries, just change the condition like following:
>>> print(df.query('date > #start_date and date < #end_date'))
col_1 date
2 0.374775 2020-01-03

You can use the method truncate:
dates = pd.date_range('2016-01-01', '2016-01-06', freq='d')
df = pd.DataFrame(index=dates, data={'A': 1})
A
2016-01-01 1
2016-01-02 1
2016-01-03 1
2016-01-04 1
2016-01-05 1
2016-01-06 1
Select data between two dates:
df.truncate(before=pd.Timestamp('2016-01-02'),
after=pd.Timestamp('2016-01-4'))
Output:
A
2016-01-02 1
2016-01-03 1
2016-01-04 1

It is highly recommended to convert a date column to an index. Doing that will give a lot of facilities. One is to select the rows between two dates easily, you can see this example:
import numpy as np
import pandas as pd
# Dataframe with monthly data between 2016 - 2020
df = pd.DataFrame(np.random.random((60, 3)))
df['date'] = pd.date_range('2016-1-1', periods=60, freq='M')
To select the rows between 2017-01-01 and 2019-01-01, you need only to convert the date column to an index:
df.set_index('date', inplace=True)
and then only slicing:
df.loc['2017':'2019']
You can select the date column as index while reading the csv file directly instead of the df.set_index():
df = pd.read_csv('file_name.csv',index_col='date')

I prefer not to alter the df.
An option is to retrieve the index of the start and end dates:
import numpy as np
import pandas as pd
#Dummy DataFrame
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
#Get the index of the start and end dates respectively
start = df[df['date']=='2017-01-07'].index[0]
end = df[df['date']=='2017-01-14'].index[0]
#Show the sliced df (from 2017-01-07 to 2017-01-14)
df.loc[start:end]
which results in:
0 1 2 date
6 0.5 0.8 0.8 2017-01-07
7 0.0 0.7 0.3 2017-01-08
8 0.8 0.9 0.0 2017-01-09
9 0.0 0.2 1.0 2017-01-10
10 0.6 0.1 0.9 2017-01-11
11 0.5 0.3 0.9 2017-01-12
12 0.5 0.4 0.3 2017-01-13
13 0.4 0.9 0.9 2017-01-14

Inspired by unutbu
print(df.dtypes) #Make sure the format is 'object'. Rerunning this after index will not show values.
columnName = 'YourColumnName'
df[columnName+'index'] = df[columnName] #Create a new column for index
df.set_index(columnName+'index', inplace=True) #To build index on the timestamp/dates
df.loc['2020-09-03 01:00':'2020-09-06'] #Select range from the index. This is your new Dataframe.

import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000],
'Duration':['30days','50days','55days','40days','60days','35days','55days'],
'Discount':[1000,2300,1000,1200,2500,1300,1400],
'InsertedDates':["2021-11-14","2021-11-15","2021-11-16","2021-11-17","2021-11-18","2021-11-19","2021-11-20"]
})
df = pd.DataFrame(technologies)
print(df)
Using pandas.DataFrame.loc to Filter Rows by Dates
Method 1:
mask = (df['InsertedDates'] > start_date) & (df['InsertedDates'] <= end_date)
df2 = df.loc[mask]
print(df2)
Method 2:
start_date = '2021-11-15'
end_date = '2021-11-19'
after_start_date = df["InsertedDates"] >= start_date
before_end_date = df["InsertedDates"] <= end_date
between_two_dates = after_start_date & before_end_date
df2 = df.loc[between_two_dates]
print(df2)
Using pandas.DataFrame.query() to select DataFrame Rows
start_date = '2021-11-15'
end_date = '2021-11-18'
df2 = df.query('InsertedDates >= #start_date and InsertedDates <= #end_date')
print(df2)
Select rows between two dates using DataFrame.query()
start_date = '2021-11-15'
end_date = '2021-11-18'
df2 = df.query('InsertedDates > #start_date and InsertedDates < #end_date')
print(df2)
pandas.Series.between() function Using two dates
df2 = df.loc[df["InsertedDates"].between("2021-11-16", "2021-11-18")]
print(df2)
Select DataFrame rows between two dates using DataFrame.isin()
df2 = df[df["InsertedDates"].isin(pd.date_range("2021-11-15", "2021-11-17"))]
print(df2)

you can do it with pd.date_range() and Timestamp.
Let's say you have read a csv file with a date column using parse_dates option:
df = pd.read_csv('my_file.csv', parse_dates=['my_date_col'])
Then you can define a date range index :
rge = pd.date_range(end='15/6/2020', periods=2)
and then filter your values by date thanks to a map:
df.loc[df['my_date_col'].map(lambda row: row.date() in rge)]

Related

How to set values in dataframe to a value before every date in every year [duplicate]

I've got some daily data in a Pandas DataFrame and it has a nice index. Something like this:
import pandas as pd
import numpy as np
rng = pd.date_range('1/1/2010', periods=1000, freq='D')
ts = pd.DataFrame(randn(len(rng)), index=rng, columns=['vals'])
print ts.head()
vals
2010-01-01 1.098302
2010-01-02 -1.384821
2010-01-03 -0.426329
2010-01-04 -0.587967
2010-01-05 -0.853374
I'd like to subset my DataFrame to only the records that fall between February 2 & March 3 for all years.
It seems there should be a very Pandas-esque way of doing this but I'm struggling to find it. Any help?
I don't think there is a native way to do this (there is with between times).
But you can do it naively (this will be efficient, but is a pain to write!):
In [11]: ts[((ts.index.month == 2) & (2 <= ts.index.day) # in Feb after the 2nd inclusive
| (ts.index.month == 3) & (ts.index.day <= 3))]  # in March before the 3rd inclusive
Out[11]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 94 entries, 2010-02-01 00:00:00 to 2012-03-03 00:00:00
Data columns (total 1 columns):
vals 94 non-null values
dtypes: float64(1)
To select all records of an annual returning period covering multiple months, do as follow:
rng = pd.date_range('2010-1-1', periods=1000, freq='D')
df = pd.DataFrame(np.random.randn(len(rng)), index=rng, columns=['A'])
startMM, startdd = (2,15) # Feb 15th
endMM, enddd = (10,3) # Oct 3rd
month_day = pd.concat([
df.index.to_series().dt.month,
df.index.to_series().dt.day
], axis=1).apply(tuple, axis=1)
df[(month_day >= (startMM, startdd)) & (month_day <= (endMM, enddd))]
as mentioned by #IanS in https://stackoverflow.com/a/45996897/2459096

Filter data by Year and month range [duplicate]

I am creating a DataFrame from a csv as follows:
stock = pd.read_csv('data_in/' + filename + '.csv', skipinitialspace=True)
The DataFrame has a date column. Is there a way to create a new DataFrame (or just overwrite the existing one) which only contains rows with date values that fall within a specified date range or between two specified date values?
There are two possible solutions:
Use a boolean mask, then use df.loc[mask]
Set the date column as a DatetimeIndex, then use df[start_date : end_date]
Using a boolean mask:
Ensure df['date'] is a Series with dtype datetime64[ns]:
df['date'] = pd.to_datetime(df['date'])
Make a boolean mask. start_date and end_date can be datetime.datetimes,
np.datetime64s, pd.Timestamps, or even datetime strings:
#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)
Select the sub-DataFrame:
df.loc[mask]
or re-assign to df
df = df.loc[mask]
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])
yields
0 1 2 date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10
Using a DatetimeIndex:
If you are going to do a lot of selections by date, it may be quicker to set the
date column as the index first. Then you can select rows by date using
df.loc[start_date:end_date].
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])
yields
0 1 2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337
While Python list indexing, e.g. seq[start:end] includes start but not end, in contrast, Pandas df.loc[start_date : end_date] includes both end-points in the result if they are in the index. Neither start_date nor end_date has to be in the index however.
Also note that pd.read_csv has a parse_dates parameter which you could use to parse the date column as datetime64s. Thus, if you use parse_dates, you would not need to use df['date'] = pd.to_datetime(df['date']).
I feel the best option will be to use the direct checks rather than using loc function:
df = df[(df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')]
It works for me.
Major issue with loc function with a slice is that the limits should be present in the actual values, if not this will result in KeyError.
You can also use between:
df[df.some_date.between(start_date, end_date)]
You can use the isin method on the date column like so
df[df["date"].isin(pd.date_range(start_date, end_date))]
Note: This only works with dates (as the question asks) and not timestamps.
Example:
import numpy as np
import pandas as pd
# Make a DataFrame with dates and random numbers
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
# Select the rows between two dates
in_range_df = df[df["date"].isin(pd.date_range("2017-01-15", "2017-01-20"))]
print(in_range_df) # print result
which gives
0 1 2 date
14 0.960974 0.144271 0.839593 2017-01-15
15 0.814376 0.723757 0.047840 2017-01-16
16 0.911854 0.123130 0.120995 2017-01-17
17 0.505804 0.416935 0.928514 2017-01-18
18 0.204869 0.708258 0.170792 2017-01-19
19 0.014389 0.214510 0.045201 2017-01-20
Keeping the solution simple and pythonic, I would suggest you to try this.
In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates.
import pandas as pd
data_frame = data_frame.set_index('date')
df = data_frame[(data_frame.index > '2017-08-10') & (data_frame.index <= '2017-08-15')]
pandas 0.22 has a between() function.
Makes answering this question easier and more readable code.
# create a single column DataFrame with dates going from Jan 1st 2018 to Jan 1st 2019
df = pd.DataFrame({'dates':pd.date_range('2018-01-01','2019-01-01')})
Let's say you want to grab the dates between Nov 27th 2018 and Jan 15th 2019:
# use the between statement to get a boolean mask
df['dates'].between('2018-11-27','2019-01-15', inclusive=False)
0 False
1 False
2 False
3 False
4 False
# you can pass this boolean mask straight to loc
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=False)]
dates
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
335 2018-12-02
Notice the inclusive argument. very helpful when you want to be explicit about your range. notice when set to True we return Nov 27th of 2018 as well:
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
dates
330 2018-11-27
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
This method is also faster than the previously mentioned isin method:
%%timeit -n 5
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
868 µs ± 164 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
%%timeit -n 5
df.loc[df['dates'].isin(pd.date_range('2018-01-01','2019-01-01'))]
1.53 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
However, it is not faster than the currently accepted answer, provided by unutbu, only if the mask is already created. but if the mask is dynamic and needs to be reassigned over and over, my method may be more efficient:
# already create the mask THEN time the function
start_date = dt.datetime(2018,11,27)
end_date = dt.datetime(2019,1,15)
mask = (df['dates'] > start_date) & (df['dates'] <= end_date)
%%timeit -n 5
df.loc[mask]
191 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
Another option, how to achieve this, is by using pandas.DataFrame.query() method. Let me show you an example on the following data frame called df.
>>> df = pd.DataFrame(np.random.random((5, 1)), columns=['col_1'])
>>> df['date'] = pd.date_range('2020-1-1', periods=5, freq='D')
>>> print(df)
col_1 date
0 0.015198 2020-01-01
1 0.638600 2020-01-02
2 0.348485 2020-01-03
3 0.247583 2020-01-04
4 0.581835 2020-01-05
As an argument, use the condition for filtering like this:
>>> start_date, end_date = '2020-01-02', '2020-01-04'
>>> print(df.query('date >= #start_date and date <= #end_date'))
col_1 date
1 0.244104 2020-01-02
2 0.374775 2020-01-03
3 0.510053 2020-01-04
If you do not want to include boundaries, just change the condition like following:
>>> print(df.query('date > #start_date and date < #end_date'))
col_1 date
2 0.374775 2020-01-03
You can use the method truncate:
dates = pd.date_range('2016-01-01', '2016-01-06', freq='d')
df = pd.DataFrame(index=dates, data={'A': 1})
A
2016-01-01 1
2016-01-02 1
2016-01-03 1
2016-01-04 1
2016-01-05 1
2016-01-06 1
Select data between two dates:
df.truncate(before=pd.Timestamp('2016-01-02'),
after=pd.Timestamp('2016-01-4'))
Output:
A
2016-01-02 1
2016-01-03 1
2016-01-04 1
It is highly recommended to convert a date column to an index. Doing that will give a lot of facilities. One is to select the rows between two dates easily, you can see this example:
import numpy as np
import pandas as pd
# Dataframe with monthly data between 2016 - 2020
df = pd.DataFrame(np.random.random((60, 3)))
df['date'] = pd.date_range('2016-1-1', periods=60, freq='M')
To select the rows between 2017-01-01 and 2019-01-01, you need only to convert the date column to an index:
df.set_index('date', inplace=True)
and then only slicing:
df.loc['2017':'2019']
You can select the date column as index while reading the csv file directly instead of the df.set_index():
df = pd.read_csv('file_name.csv',index_col='date')
I prefer not to alter the df.
An option is to retrieve the index of the start and end dates:
import numpy as np
import pandas as pd
#Dummy DataFrame
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
#Get the index of the start and end dates respectively
start = df[df['date']=='2017-01-07'].index[0]
end = df[df['date']=='2017-01-14'].index[0]
#Show the sliced df (from 2017-01-07 to 2017-01-14)
df.loc[start:end]
which results in:
0 1 2 date
6 0.5 0.8 0.8 2017-01-07
7 0.0 0.7 0.3 2017-01-08
8 0.8 0.9 0.0 2017-01-09
9 0.0 0.2 1.0 2017-01-10
10 0.6 0.1 0.9 2017-01-11
11 0.5 0.3 0.9 2017-01-12
12 0.5 0.4 0.3 2017-01-13
13 0.4 0.9 0.9 2017-01-14
Inspired by unutbu
print(df.dtypes) #Make sure the format is 'object'. Rerunning this after index will not show values.
columnName = 'YourColumnName'
df[columnName+'index'] = df[columnName] #Create a new column for index
df.set_index(columnName+'index', inplace=True) #To build index on the timestamp/dates
df.loc['2020-09-03 01:00':'2020-09-06'] #Select range from the index. This is your new Dataframe.
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000],
'Duration':['30days','50days','55days','40days','60days','35days','55days'],
'Discount':[1000,2300,1000,1200,2500,1300,1400],
'InsertedDates':["2021-11-14","2021-11-15","2021-11-16","2021-11-17","2021-11-18","2021-11-19","2021-11-20"]
})
df = pd.DataFrame(technologies)
print(df)
Using pandas.DataFrame.loc to Filter Rows by Dates
Method 1:
mask = (df['InsertedDates'] > start_date) & (df['InsertedDates'] <= end_date)
df2 = df.loc[mask]
print(df2)
Method 2:
start_date = '2021-11-15'
end_date = '2021-11-19'
after_start_date = df["InsertedDates"] >= start_date
before_end_date = df["InsertedDates"] <= end_date
between_two_dates = after_start_date & before_end_date
df2 = df.loc[between_two_dates]
print(df2)
Using pandas.DataFrame.query() to select DataFrame Rows
start_date = '2021-11-15'
end_date = '2021-11-18'
df2 = df.query('InsertedDates >= #start_date and InsertedDates <= #end_date')
print(df2)
Select rows between two dates using DataFrame.query()
start_date = '2021-11-15'
end_date = '2021-11-18'
df2 = df.query('InsertedDates > #start_date and InsertedDates < #end_date')
print(df2)
pandas.Series.between() function Using two dates
df2 = df.loc[df["InsertedDates"].between("2021-11-16", "2021-11-18")]
print(df2)
Select DataFrame rows between two dates using DataFrame.isin()
df2 = df[df["InsertedDates"].isin(pd.date_range("2021-11-15", "2021-11-17"))]
print(df2)
you can do it with pd.date_range() and Timestamp.
Let's say you have read a csv file with a date column using parse_dates option:
df = pd.read_csv('my_file.csv', parse_dates=['my_date_col'])
Then you can define a date range index :
rge = pd.date_range(end='15/6/2020', periods=2)
and then filter your values by date thanks to a map:
df.loc[df['my_date_col'].map(lambda row: row.date() in rge)]

Append values in pandas where value equals other value

I have two data frames:
dfi = pd.read_csv('C:/Users/Mauricio/Desktop/inflation.csv')
dfm = pd.read_csv('C:/Users/Mauricio/Desktop/maturity.csv')
# equals the following
observation_date CPIAUCSL
0 1947-01-01 21.48
1 1947-02-01 21.62
2 1947-03-01 22.00
3 1947-04-01 22.00
4 1947-05-01 21.95
observation_date DGS10
0 1962-01-02 4.06
1 1962-01-03 4.03
2 1962-01-04 3.99
3 1962-01-05 4.02
4 1962-01-08 4.03
I created a copy as df doing the following:
df = dfi.copy(deep=True)
which returns an exact copy of dfi, dfi dates go by month and dfm dates go by day, I want to create a new column in df that everytime a date in dfi == a date in dfm, to append the DGS10 value in it.
I have this so far:
for date in df.observation_date:
for date2 in dfm.observation_date:
if date==date2:
df['mat_rate'] = dfm['DGS10']
# this is what I get but dates do not match values
observation_date CPIAUCSL mat_rate
0 1947-01-01 21.48 4.06
1 1947-02-01 21.62 4.03
2 1947-03-01 22.00 3.99
3 1947-04-01 22.00 4.02
4 1947-05-01 21.95 4.03
It works but does not append the dates where date == date2 what can I do so it appends the values where date equals date2 only?
Thank you!
If the date formats are inconsistent, convert them first:
dfi.observation_date = pd.to_datetime(dfi.observation_date, format='%Y-%m-%d')
dfm.observation_date = pd.to_datetime(dfm.observation_date, format='%Y-%m-%d')
Now, getting your result should be easy with a merge:
df = dfi.merge(dfm, on='observation_date')

Finding time difference between two columns in DataFrame [duplicate]

This question already has answers here:
Calculate Time Difference Between Two Pandas Columns in Hours and Minutes
(4 answers)
Closed 2 years ago.
I am trying to find the time difference between two columns of the following frame:
Test Date | Test Type | First Use Date
I used the following function definition to get the difference:
def days_between(d1, d2):
d1 = datetime.strptime(d1, "%Y-%m-%d")
d2 = datetime.strptime(d2, "%Y-%m-%d")
return abs((d2 - d1).days)
And it works fine, however it does not take a series as an input. So I had to construct a for loop that loops over indices:
age_veh = []
for i in range(0, len(data_manufacturer)-1):
age_veh[i].append(days_between(data_manufacturer.iloc[i,0], data_manufacturer.iloc[i,4]))
However, it does return an error:
IndexError: list index out of range
I don't know whether it's the right way of doing and what am I doing wrong or an alternative solution will be much appreciated. Please also bear in mind that I have around 2 mil rows.
Convert the columns using to_datetime then you can subtract the columns to produce a timedelta on the abs values, then you can call dt.days to get the total number of days, example:
In [119]:
import io
import pandas as pd
t="""Test Date,Test Type,First Use Date
2011-02-05,A,2010-01-05
2012-02-05,A,2010-03-05
2013-02-05,A,2010-06-05
2014-02-05,A,2010-08-05"""
df = pd.read_csv(io.StringIO(t))
df
Out[119]:
Test Date Test Type First Use Date
0 2011-02-05 A 2010-01-05
1 2012-02-05 A 2010-03-05
2 2013-02-05 A 2010-06-05
3 2014-02-05 A 2010-08-05
In [121]:
df['Test Date'] = pd.to_datetime(df['Test Date'])
df['First Use Date'] = pd.to_datetime(df['First Use Date'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 3 columns):
Test Date 4 non-null datetime64[ns]
Test Type 4 non-null object
First Use Date 4 non-null datetime64[ns]
dtypes: datetime64[ns](2), object(1)
memory usage: 128.0+ bytes
In [122]:
df['days'] = (df['Test Date'] - df['First Use Date']).abs().dt.days
df
Out[122]:
Test Date Test Type First Use Date days
0 2011-02-05 A 2010-01-05 396
1 2012-02-05 A 2010-03-05 702
2 2013-02-05 A 2010-06-05 976
3 2014-02-05 A 2010-08-05 1280
IIUC you can first convert columns to_datetime, use abs and then convert timedelta to days:
print df
id value date1 date2 sum
0 A 150 2014-04-08 2014-03-08 NaN
1 B 100 2014-05-08 2014-02-08 NaN
2 B 200 2014-01-08 2014-07-08 100
3 A 200 2014-04-08 2014-03-08 NaN
4 A 300 2014-06-08 2014-04-08 350
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
df['diff'] = (df['date1'] - df['date2']).abs() / np.timedelta64(1, 'D')
print df
id value date1 date2 sum diff
0 A 150 2014-04-08 2014-03-08 NaN 31
1 B 100 2014-05-08 2014-02-08 NaN 89
2 B 200 2014-01-08 2014-07-08 100 181
3 A 200 2014-04-08 2014-03-08 NaN 31
4 A 300 2014-06-08 2014-04-08 350 61
EDIT:
I think better is use for converting np.timedelta64(1, 'D') to days in larger DataFrames, because it is faster:
I use EdChum sample, only len(df) = 4k:
import io
import pandas as pd
import numpy as np
t=u"""Test Date,Test Type,First Use Date
2011-02-05,A,2010-01-05
2012-02-05,A,2010-03-05
2013-02-05,A,2010-06-05
2014-02-05,A,2010-08-05"""
df = pd.read_csv(io.StringIO(t))
df = pd.concat([df]*1000).reset_index(drop=True)
df['Test Date'] = pd.to_datetime(df['Test Date'])
df['First Use Date'] = pd.to_datetime(df['First Use Date'])
print (df['Test Date'] - df['First Use Date']).abs().dt.days
print (df['Test Date'] - df['First Use Date']).abs() / np.timedelta64(1, 'D')
Timings:
In [174]: %timeit (df['Test Date'] - df['First Use Date']).abs().dt.days
10 loops, best of 3: 38.8 ms per loop
In [175]: %timeit (df['Test Date'] - df['First Use Date']).abs() / np.timedelta64(1, 'D')
1000 loops, best of 3: 1.62 ms per loop

Calculate time in certain state for time series data

I have an irregularly indexed time series of data with seconds resolution like:
import pandas as pd
idx = ['2012-01-01 12:43:35', '2012-03-12 15:46:43',
'2012-09-26 18:35:11', '2012-11-11 2:34:59']
status = [1, 0, 1, 0]
df = pd.DataFrame(status, index=idx, columns = ['status'])
df = df.reindex(pd.to_datetime(df.index))
In [62]: df
Out[62]:
status
2012-01-01 12:43:35 1
2012-03-12 15:46:43 0
2012-09-26 18:35:11 1
2012-11-11 02:34:59 0
and I am interested in the fraction of the year when the status is 1. The way I currently do it is that I reindex df with every second in the year and use forward filling like:
full_idx = pd.date_range(start = '1/1/2012', end = '12/31/2012', freq='s')
df1 = df.reindex(full_idx, method='ffill')
which returns a DataFrame that contains every second for the year which I can then calculate the mean for, to see the percentage of time in the 1 status like:
In [66]: df1
Out[66]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 31536001 entries, 2012-01-01 00:00:00 to 2012-12-31 00:00:00
Freq: S
Data columns:
status 31490186 non-null values
dtypes: float64(1)
In [67]: df1.status.mean()
Out[67]: 0.31953371123308066
The problem is that I have to do this for a lot of data, and reindexing it for every second in the year is most expensive operation by far.
What are better ways to do this?
There doesn't seem to be a pandas method to compute time differences between entries of an irregular time series, though there is a convenience method to convert a time series index to an array of datetime.datetime objects, which can be converted to datetime.timedelta objects through subtraction.
In [6]: start_end = pd.DataFrame({'status': [0, 0]},
index=[pd.datetools.parse('1/1/2012'),
pd.datetools.parse('12/31/2012')])
In [7]: df = df.append(start_end).sort()
In [8]: df
Out[8]:
status
2012-01-01 00:00:00 0
2012-01-01 12:43:35 1
2012-03-12 15:46:43 0
2012-09-26 18:35:11 1
2012-11-11 02:34:59 0
2012-12-31 00:00:00 0
In [9]: pydatetime = pd.Series(df.index.to_pydatetime(), index=df.index)
In [11]: df['duration'] = pydatetime.diff().shift(-1).\
map(datetime.timedelta.total_seconds, na_action='ignore')
In [16]: df
Out[16]:
status duration
2012-01-01 00:00:00 0 45815
2012-01-01 12:43:35 1 6145388
2012-03-12 15:46:43 0 17117308
2012-09-26 18:35:11 1 3916788
2012-11-11 02:34:59 0 4310701
2012-12-31 00:00:00 0 NaN
In [17]: (df.status * df.duration).sum() / df.duration.sum()
Out[17]: 0.31906950786402843
Note:
Our answers seem to differ because I set status before the first timestamp to zero, while those entries are NA in your df1 as there's no start value to forward fill and NA values are excluded by pandas mean().
timedelta.total_seconds() is new in Python 2.7.
Timing comparison of this method versus reindexing:
In [8]: timeit delta_method(df)
1000 loops, best of 3: 1.3 ms per loop
In [9]: timeit redindexing(df)
1 loops, best of 3: 2.78 s per loop
Another potential approach is to use traces.
import traces
from dateutil.parser import parse as date_parse
idx = ['2012-01-01 12:43:35', '2012-03-12 15:46:43',
'2012-09-26 18:35:11', '2012-11-11 2:34:59']
status = [1, 0, 1, 0]
# create a TimeSeries from date strings and status
ts = traces.TimeSeries(default=0)
for date_string, status_value in zip(idx, status):
ts[date_parse(date_string)] = status_value
# compute distribution
ts.distribution(
start=date_parse('2012-01-01'),
end=date_parse('2013-01-01'),
)
# {0: 0.6818022667476219, 1: 0.31819773325237805}
The value is calculated between the start of January 1, 2012 and end of December 31, 2012 (equivalently the start of January 1, 2013) without resampling, and assuming the status is 0 at the start of the year (the default=0 parameter)
Timing results:
In [2]: timeit ts.distribution(
start=date_parse('2012-01-01'),
end=date_parse('2013-01-01')
)
619 µs ± 7.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Categories

Resources