Append values in pandas where value equals other value - python

I have two data frames:
dfi = pd.read_csv('C:/Users/Mauricio/Desktop/inflation.csv')
dfm = pd.read_csv('C:/Users/Mauricio/Desktop/maturity.csv')
# equals the following
observation_date CPIAUCSL
0 1947-01-01 21.48
1 1947-02-01 21.62
2 1947-03-01 22.00
3 1947-04-01 22.00
4 1947-05-01 21.95
observation_date DGS10
0 1962-01-02 4.06
1 1962-01-03 4.03
2 1962-01-04 3.99
3 1962-01-05 4.02
4 1962-01-08 4.03
I created a copy as df doing the following:
df = dfi.copy(deep=True)
which returns an exact copy of dfi, dfi dates go by month and dfm dates go by day, I want to create a new column in df that everytime a date in dfi == a date in dfm, to append the DGS10 value in it.
I have this so far:
for date in df.observation_date:
for date2 in dfm.observation_date:
if date==date2:
df['mat_rate'] = dfm['DGS10']
# this is what I get but dates do not match values
observation_date CPIAUCSL mat_rate
0 1947-01-01 21.48 4.06
1 1947-02-01 21.62 4.03
2 1947-03-01 22.00 3.99
3 1947-04-01 22.00 4.02
4 1947-05-01 21.95 4.03
It works but does not append the dates where date == date2 what can I do so it appends the values where date equals date2 only?
Thank you!

If the date formats are inconsistent, convert them first:
dfi.observation_date = pd.to_datetime(dfi.observation_date, format='%Y-%m-%d')
dfm.observation_date = pd.to_datetime(dfm.observation_date, format='%Y-%m-%d')
Now, getting your result should be easy with a merge:
df = dfi.merge(dfm, on='observation_date')

Related

Plot scatter for range of dates in matplotlib [duplicate]

I am creating a DataFrame from a csv as follows:
stock = pd.read_csv('data_in/' + filename + '.csv', skipinitialspace=True)
The DataFrame has a date column. Is there a way to create a new DataFrame (or just overwrite the existing one) which only contains rows with date values that fall within a specified date range or between two specified date values?
There are two possible solutions:
Use a boolean mask, then use df.loc[mask]
Set the date column as a DatetimeIndex, then use df[start_date : end_date]
Using a boolean mask:
Ensure df['date'] is a Series with dtype datetime64[ns]:
df['date'] = pd.to_datetime(df['date'])
Make a boolean mask. start_date and end_date can be datetime.datetimes,
np.datetime64s, pd.Timestamps, or even datetime strings:
#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)
Select the sub-DataFrame:
df.loc[mask]
or re-assign to df
df = df.loc[mask]
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])
yields
0 1 2 date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10
Using a DatetimeIndex:
If you are going to do a lot of selections by date, it may be quicker to set the
date column as the index first. Then you can select rows by date using
df.loc[start_date:end_date].
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])
yields
0 1 2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337
While Python list indexing, e.g. seq[start:end] includes start but not end, in contrast, Pandas df.loc[start_date : end_date] includes both end-points in the result if they are in the index. Neither start_date nor end_date has to be in the index however.
Also note that pd.read_csv has a parse_dates parameter which you could use to parse the date column as datetime64s. Thus, if you use parse_dates, you would not need to use df['date'] = pd.to_datetime(df['date']).
I feel the best option will be to use the direct checks rather than using loc function:
df = df[(df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')]
It works for me.
Major issue with loc function with a slice is that the limits should be present in the actual values, if not this will result in KeyError.
You can also use between:
df[df.some_date.between(start_date, end_date)]
You can use the isin method on the date column like so
df[df["date"].isin(pd.date_range(start_date, end_date))]
Note: This only works with dates (as the question asks) and not timestamps.
Example:
import numpy as np
import pandas as pd
# Make a DataFrame with dates and random numbers
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
# Select the rows between two dates
in_range_df = df[df["date"].isin(pd.date_range("2017-01-15", "2017-01-20"))]
print(in_range_df) # print result
which gives
0 1 2 date
14 0.960974 0.144271 0.839593 2017-01-15
15 0.814376 0.723757 0.047840 2017-01-16
16 0.911854 0.123130 0.120995 2017-01-17
17 0.505804 0.416935 0.928514 2017-01-18
18 0.204869 0.708258 0.170792 2017-01-19
19 0.014389 0.214510 0.045201 2017-01-20
Keeping the solution simple and pythonic, I would suggest you to try this.
In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates.
import pandas as pd
data_frame = data_frame.set_index('date')
df = data_frame[(data_frame.index > '2017-08-10') & (data_frame.index <= '2017-08-15')]
pandas 0.22 has a between() function.
Makes answering this question easier and more readable code.
# create a single column DataFrame with dates going from Jan 1st 2018 to Jan 1st 2019
df = pd.DataFrame({'dates':pd.date_range('2018-01-01','2019-01-01')})
Let's say you want to grab the dates between Nov 27th 2018 and Jan 15th 2019:
# use the between statement to get a boolean mask
df['dates'].between('2018-11-27','2019-01-15', inclusive=False)
0 False
1 False
2 False
3 False
4 False
# you can pass this boolean mask straight to loc
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=False)]
dates
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
335 2018-12-02
Notice the inclusive argument. very helpful when you want to be explicit about your range. notice when set to True we return Nov 27th of 2018 as well:
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
dates
330 2018-11-27
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
This method is also faster than the previously mentioned isin method:
%%timeit -n 5
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
868 µs ± 164 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
%%timeit -n 5
df.loc[df['dates'].isin(pd.date_range('2018-01-01','2019-01-01'))]
1.53 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
However, it is not faster than the currently accepted answer, provided by unutbu, only if the mask is already created. but if the mask is dynamic and needs to be reassigned over and over, my method may be more efficient:
# already create the mask THEN time the function
start_date = dt.datetime(2018,11,27)
end_date = dt.datetime(2019,1,15)
mask = (df['dates'] > start_date) & (df['dates'] <= end_date)
%%timeit -n 5
df.loc[mask]
191 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
Another option, how to achieve this, is by using pandas.DataFrame.query() method. Let me show you an example on the following data frame called df.
>>> df = pd.DataFrame(np.random.random((5, 1)), columns=['col_1'])
>>> df['date'] = pd.date_range('2020-1-1', periods=5, freq='D')
>>> print(df)
col_1 date
0 0.015198 2020-01-01
1 0.638600 2020-01-02
2 0.348485 2020-01-03
3 0.247583 2020-01-04
4 0.581835 2020-01-05
As an argument, use the condition for filtering like this:
>>> start_date, end_date = '2020-01-02', '2020-01-04'
>>> print(df.query('date >= #start_date and date <= #end_date'))
col_1 date
1 0.244104 2020-01-02
2 0.374775 2020-01-03
3 0.510053 2020-01-04
If you do not want to include boundaries, just change the condition like following:
>>> print(df.query('date > #start_date and date < #end_date'))
col_1 date
2 0.374775 2020-01-03
You can use the method truncate:
dates = pd.date_range('2016-01-01', '2016-01-06', freq='d')
df = pd.DataFrame(index=dates, data={'A': 1})
A
2016-01-01 1
2016-01-02 1
2016-01-03 1
2016-01-04 1
2016-01-05 1
2016-01-06 1
Select data between two dates:
df.truncate(before=pd.Timestamp('2016-01-02'),
after=pd.Timestamp('2016-01-4'))
Output:
A
2016-01-02 1
2016-01-03 1
2016-01-04 1
It is highly recommended to convert a date column to an index. Doing that will give a lot of facilities. One is to select the rows between two dates easily, you can see this example:
import numpy as np
import pandas as pd
# Dataframe with monthly data between 2016 - 2020
df = pd.DataFrame(np.random.random((60, 3)))
df['date'] = pd.date_range('2016-1-1', periods=60, freq='M')
To select the rows between 2017-01-01 and 2019-01-01, you need only to convert the date column to an index:
df.set_index('date', inplace=True)
and then only slicing:
df.loc['2017':'2019']
You can select the date column as index while reading the csv file directly instead of the df.set_index():
df = pd.read_csv('file_name.csv',index_col='date')
I prefer not to alter the df.
An option is to retrieve the index of the start and end dates:
import numpy as np
import pandas as pd
#Dummy DataFrame
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
#Get the index of the start and end dates respectively
start = df[df['date']=='2017-01-07'].index[0]
end = df[df['date']=='2017-01-14'].index[0]
#Show the sliced df (from 2017-01-07 to 2017-01-14)
df.loc[start:end]
which results in:
0 1 2 date
6 0.5 0.8 0.8 2017-01-07
7 0.0 0.7 0.3 2017-01-08
8 0.8 0.9 0.0 2017-01-09
9 0.0 0.2 1.0 2017-01-10
10 0.6 0.1 0.9 2017-01-11
11 0.5 0.3 0.9 2017-01-12
12 0.5 0.4 0.3 2017-01-13
13 0.4 0.9 0.9 2017-01-14
Inspired by unutbu
print(df.dtypes) #Make sure the format is 'object'. Rerunning this after index will not show values.
columnName = 'YourColumnName'
df[columnName+'index'] = df[columnName] #Create a new column for index
df.set_index(columnName+'index', inplace=True) #To build index on the timestamp/dates
df.loc['2020-09-03 01:00':'2020-09-06'] #Select range from the index. This is your new Dataframe.
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000],
'Duration':['30days','50days','55days','40days','60days','35days','55days'],
'Discount':[1000,2300,1000,1200,2500,1300,1400],
'InsertedDates':["2021-11-14","2021-11-15","2021-11-16","2021-11-17","2021-11-18","2021-11-19","2021-11-20"]
})
df = pd.DataFrame(technologies)
print(df)
Using pandas.DataFrame.loc to Filter Rows by Dates
Method 1:
mask = (df['InsertedDates'] > start_date) & (df['InsertedDates'] <= end_date)
df2 = df.loc[mask]
print(df2)
Method 2:
start_date = '2021-11-15'
end_date = '2021-11-19'
after_start_date = df["InsertedDates"] >= start_date
before_end_date = df["InsertedDates"] <= end_date
between_two_dates = after_start_date & before_end_date
df2 = df.loc[between_two_dates]
print(df2)
Using pandas.DataFrame.query() to select DataFrame Rows
start_date = '2021-11-15'
end_date = '2021-11-18'
df2 = df.query('InsertedDates >= #start_date and InsertedDates <= #end_date')
print(df2)
Select rows between two dates using DataFrame.query()
start_date = '2021-11-15'
end_date = '2021-11-18'
df2 = df.query('InsertedDates > #start_date and InsertedDates < #end_date')
print(df2)
pandas.Series.between() function Using two dates
df2 = df.loc[df["InsertedDates"].between("2021-11-16", "2021-11-18")]
print(df2)
Select DataFrame rows between two dates using DataFrame.isin()
df2 = df[df["InsertedDates"].isin(pd.date_range("2021-11-15", "2021-11-17"))]
print(df2)
you can do it with pd.date_range() and Timestamp.
Let's say you have read a csv file with a date column using parse_dates option:
df = pd.read_csv('my_file.csv', parse_dates=['my_date_col'])
Then you can define a date range index :
rge = pd.date_range(end='15/6/2020', periods=2)
and then filter your values by date thanks to a map:
df.loc[df['my_date_col'].map(lambda row: row.date() in rge)]

Filter data by Year and month range [duplicate]

I am creating a DataFrame from a csv as follows:
stock = pd.read_csv('data_in/' + filename + '.csv', skipinitialspace=True)
The DataFrame has a date column. Is there a way to create a new DataFrame (or just overwrite the existing one) which only contains rows with date values that fall within a specified date range or between two specified date values?
There are two possible solutions:
Use a boolean mask, then use df.loc[mask]
Set the date column as a DatetimeIndex, then use df[start_date : end_date]
Using a boolean mask:
Ensure df['date'] is a Series with dtype datetime64[ns]:
df['date'] = pd.to_datetime(df['date'])
Make a boolean mask. start_date and end_date can be datetime.datetimes,
np.datetime64s, pd.Timestamps, or even datetime strings:
#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)
Select the sub-DataFrame:
df.loc[mask]
or re-assign to df
df = df.loc[mask]
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])
yields
0 1 2 date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10
Using a DatetimeIndex:
If you are going to do a lot of selections by date, it may be quicker to set the
date column as the index first. Then you can select rows by date using
df.loc[start_date:end_date].
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])
yields
0 1 2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337
While Python list indexing, e.g. seq[start:end] includes start but not end, in contrast, Pandas df.loc[start_date : end_date] includes both end-points in the result if they are in the index. Neither start_date nor end_date has to be in the index however.
Also note that pd.read_csv has a parse_dates parameter which you could use to parse the date column as datetime64s. Thus, if you use parse_dates, you would not need to use df['date'] = pd.to_datetime(df['date']).
I feel the best option will be to use the direct checks rather than using loc function:
df = df[(df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')]
It works for me.
Major issue with loc function with a slice is that the limits should be present in the actual values, if not this will result in KeyError.
You can also use between:
df[df.some_date.between(start_date, end_date)]
You can use the isin method on the date column like so
df[df["date"].isin(pd.date_range(start_date, end_date))]
Note: This only works with dates (as the question asks) and not timestamps.
Example:
import numpy as np
import pandas as pd
# Make a DataFrame with dates and random numbers
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
# Select the rows between two dates
in_range_df = df[df["date"].isin(pd.date_range("2017-01-15", "2017-01-20"))]
print(in_range_df) # print result
which gives
0 1 2 date
14 0.960974 0.144271 0.839593 2017-01-15
15 0.814376 0.723757 0.047840 2017-01-16
16 0.911854 0.123130 0.120995 2017-01-17
17 0.505804 0.416935 0.928514 2017-01-18
18 0.204869 0.708258 0.170792 2017-01-19
19 0.014389 0.214510 0.045201 2017-01-20
Keeping the solution simple and pythonic, I would suggest you to try this.
In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates.
import pandas as pd
data_frame = data_frame.set_index('date')
df = data_frame[(data_frame.index > '2017-08-10') & (data_frame.index <= '2017-08-15')]
pandas 0.22 has a between() function.
Makes answering this question easier and more readable code.
# create a single column DataFrame with dates going from Jan 1st 2018 to Jan 1st 2019
df = pd.DataFrame({'dates':pd.date_range('2018-01-01','2019-01-01')})
Let's say you want to grab the dates between Nov 27th 2018 and Jan 15th 2019:
# use the between statement to get a boolean mask
df['dates'].between('2018-11-27','2019-01-15', inclusive=False)
0 False
1 False
2 False
3 False
4 False
# you can pass this boolean mask straight to loc
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=False)]
dates
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
335 2018-12-02
Notice the inclusive argument. very helpful when you want to be explicit about your range. notice when set to True we return Nov 27th of 2018 as well:
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
dates
330 2018-11-27
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
This method is also faster than the previously mentioned isin method:
%%timeit -n 5
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
868 µs ± 164 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
%%timeit -n 5
df.loc[df['dates'].isin(pd.date_range('2018-01-01','2019-01-01'))]
1.53 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
However, it is not faster than the currently accepted answer, provided by unutbu, only if the mask is already created. but if the mask is dynamic and needs to be reassigned over and over, my method may be more efficient:
# already create the mask THEN time the function
start_date = dt.datetime(2018,11,27)
end_date = dt.datetime(2019,1,15)
mask = (df['dates'] > start_date) & (df['dates'] <= end_date)
%%timeit -n 5
df.loc[mask]
191 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
Another option, how to achieve this, is by using pandas.DataFrame.query() method. Let me show you an example on the following data frame called df.
>>> df = pd.DataFrame(np.random.random((5, 1)), columns=['col_1'])
>>> df['date'] = pd.date_range('2020-1-1', periods=5, freq='D')
>>> print(df)
col_1 date
0 0.015198 2020-01-01
1 0.638600 2020-01-02
2 0.348485 2020-01-03
3 0.247583 2020-01-04
4 0.581835 2020-01-05
As an argument, use the condition for filtering like this:
>>> start_date, end_date = '2020-01-02', '2020-01-04'
>>> print(df.query('date >= #start_date and date <= #end_date'))
col_1 date
1 0.244104 2020-01-02
2 0.374775 2020-01-03
3 0.510053 2020-01-04
If you do not want to include boundaries, just change the condition like following:
>>> print(df.query('date > #start_date and date < #end_date'))
col_1 date
2 0.374775 2020-01-03
You can use the method truncate:
dates = pd.date_range('2016-01-01', '2016-01-06', freq='d')
df = pd.DataFrame(index=dates, data={'A': 1})
A
2016-01-01 1
2016-01-02 1
2016-01-03 1
2016-01-04 1
2016-01-05 1
2016-01-06 1
Select data between two dates:
df.truncate(before=pd.Timestamp('2016-01-02'),
after=pd.Timestamp('2016-01-4'))
Output:
A
2016-01-02 1
2016-01-03 1
2016-01-04 1
It is highly recommended to convert a date column to an index. Doing that will give a lot of facilities. One is to select the rows between two dates easily, you can see this example:
import numpy as np
import pandas as pd
# Dataframe with monthly data between 2016 - 2020
df = pd.DataFrame(np.random.random((60, 3)))
df['date'] = pd.date_range('2016-1-1', periods=60, freq='M')
To select the rows between 2017-01-01 and 2019-01-01, you need only to convert the date column to an index:
df.set_index('date', inplace=True)
and then only slicing:
df.loc['2017':'2019']
You can select the date column as index while reading the csv file directly instead of the df.set_index():
df = pd.read_csv('file_name.csv',index_col='date')
I prefer not to alter the df.
An option is to retrieve the index of the start and end dates:
import numpy as np
import pandas as pd
#Dummy DataFrame
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
#Get the index of the start and end dates respectively
start = df[df['date']=='2017-01-07'].index[0]
end = df[df['date']=='2017-01-14'].index[0]
#Show the sliced df (from 2017-01-07 to 2017-01-14)
df.loc[start:end]
which results in:
0 1 2 date
6 0.5 0.8 0.8 2017-01-07
7 0.0 0.7 0.3 2017-01-08
8 0.8 0.9 0.0 2017-01-09
9 0.0 0.2 1.0 2017-01-10
10 0.6 0.1 0.9 2017-01-11
11 0.5 0.3 0.9 2017-01-12
12 0.5 0.4 0.3 2017-01-13
13 0.4 0.9 0.9 2017-01-14
Inspired by unutbu
print(df.dtypes) #Make sure the format is 'object'. Rerunning this after index will not show values.
columnName = 'YourColumnName'
df[columnName+'index'] = df[columnName] #Create a new column for index
df.set_index(columnName+'index', inplace=True) #To build index on the timestamp/dates
df.loc['2020-09-03 01:00':'2020-09-06'] #Select range from the index. This is your new Dataframe.
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000],
'Duration':['30days','50days','55days','40days','60days','35days','55days'],
'Discount':[1000,2300,1000,1200,2500,1300,1400],
'InsertedDates':["2021-11-14","2021-11-15","2021-11-16","2021-11-17","2021-11-18","2021-11-19","2021-11-20"]
})
df = pd.DataFrame(technologies)
print(df)
Using pandas.DataFrame.loc to Filter Rows by Dates
Method 1:
mask = (df['InsertedDates'] > start_date) & (df['InsertedDates'] <= end_date)
df2 = df.loc[mask]
print(df2)
Method 2:
start_date = '2021-11-15'
end_date = '2021-11-19'
after_start_date = df["InsertedDates"] >= start_date
before_end_date = df["InsertedDates"] <= end_date
between_two_dates = after_start_date & before_end_date
df2 = df.loc[between_two_dates]
print(df2)
Using pandas.DataFrame.query() to select DataFrame Rows
start_date = '2021-11-15'
end_date = '2021-11-18'
df2 = df.query('InsertedDates >= #start_date and InsertedDates <= #end_date')
print(df2)
Select rows between two dates using DataFrame.query()
start_date = '2021-11-15'
end_date = '2021-11-18'
df2 = df.query('InsertedDates > #start_date and InsertedDates < #end_date')
print(df2)
pandas.Series.between() function Using two dates
df2 = df.loc[df["InsertedDates"].between("2021-11-16", "2021-11-18")]
print(df2)
Select DataFrame rows between two dates using DataFrame.isin()
df2 = df[df["InsertedDates"].isin(pd.date_range("2021-11-15", "2021-11-17"))]
print(df2)
you can do it with pd.date_range() and Timestamp.
Let's say you have read a csv file with a date column using parse_dates option:
df = pd.read_csv('my_file.csv', parse_dates=['my_date_col'])
Then you can define a date range index :
rge = pd.date_range(end='15/6/2020', periods=2)
and then filter your values by date thanks to a map:
df.loc[df['my_date_col'].map(lambda row: row.date() in rge)]

Not able to convert dataframe index from datetime format

download link1I have extracted raw data from the a csv file and I have set the index column to Date.Here is in the attached clip
The index is not in datetime format and when I try to convert using the below code
df.index=pd.to_datetime(df.index)
I get this error:
"ValueError: month must be in 1..12"
The current dtype for the index is 'object'
I have seen some previous questions related to conversion to datetime but am afraid I couldn't use that to get a solution to my question. Could someone help please?
thanks,
There is problem 3 different types of datetimes - solution is parse each separately - for unmatched values are created NaNs, so for replace them use Series.combine_first:
df = pd.read_csv('FFdata1.csv', index_col=['Date'])
df = df.reset_index()
#format YYDDMM
d1 = pd.to_datetime(df['Date'], format='%y%d%m', errors='coerce')
#format YYYY
d2 = pd.to_datetime(df['Date'], format='%Y', errors='coerce')
#format YYYYMM
d3 = pd.to_datetime(df['Date'], format='%Y%m', errors='coerce')
df['Date'] = d1.combine_first(d2).combine_first(d3)
#check not parsed datetimes
print(df[df['Date'].isna()])
Date Mkt-RF SMB HML RF
1113 NaT NaN NaN NaN NaN
1114 NaT NaN NaN NaN NaN
1115 NaT Mkt-RF SMB HML RF
1208 NaT NaN NaN NaN NaN
1209 NaT NaN NaN NaN NaN
Another possible solution is create 3 separate DataFrames:
df = pd.read_csv('FFdata1.csv', index_col=['Date'])
df = df.reset_index()
#format YYDDMM
d1 = pd.to_datetime(df['Date'], format='%y%d%m', errors='coerce')
df1 = df.assign(Date=d1).dropna(subset=['Date'])
print (df1.head())
Date Mkt-RF SMB HML RF
0 2019-07-26 2.96 -2.3 -2.87 0.22
1 2019-08-26 2.64 -1.4 4.19 0.25
2 2019-09-26 0.36 -1.32 0.01 0.23
3 2019-10-26 -3.24 0.04 0.51 0.32
4 2019-11-26 2.53 -0.2 -0.35 0.31
#format YYYY
d2 = pd.to_datetime(df['Date'], format='%Y', errors='coerce')
df2 = df.assign(Date=d2).dropna(subset=['Date'])
print (df2.head())
Date Mkt-RF SMB HML RF
1116 1927-01-01 29.47 -2.46 -3.75 3.12
1117 1928-01-01 35.39 4.2 -6.15 3.56
1118 1929-01-01 -19.54 -30.8 11.81 4.75
1119 1930-01-01 -31.23 -5.13 -12.28 2.41
1120 1931-01-01 -45.11 3.53 -14.29 1.07
#format YYYYMM
d3 = pd.to_datetime(df['Date'], format='%Y%m', errors='coerce')
df3 = df.assign(Date=d3).dropna(subset=['Date'])
print (df3.head())
Date Mkt-RF SMB HML RF
0 1926-07-01 2.96 -2.3 -2.87 0.22
1 1926-08-01 2.64 -1.4 4.19 0.25
2 1926-09-01 0.36 -1.32 0.01 0.23
3 1926-10-01 -3.24 0.04 0.51 0.32
4 1926-11-01 2.53 -0.2 -0.35 0.31
The file contains more than one data series. The beginning of the file has a header line and then dates formatted as %Y%m. But at line 1115, we find a line containing only empty values, followed with a textual information (Annual Factors: January-December), a new header line and than annual data with date formatted as %Y only. This is far beyond what read_csv can automagically process.
So my advice is to first load the file without trying to parse the Date column, then reject any line past the first one containing an empty date, and only then parse the date on the remaining lines.
Code could be:
df = pd.read_csv('FFdata1.csv').loc[df.index < df[df.Date.isna()].index[0]]
df['Date'] = pd.to_datetime(df.Date,format='%Y%m')
df.set_index('Date', inplace=True)

Extract and transform Data from Dataset

I have a Dataset that is in following format:
Time-Stamp (dd-mm-yyyy) Temperature
I need to extract Day and Month from the Time-Stamp information from each observation in the series
Current Dataset format:
0 1981-01-01 20.7
1 1981-01-02 17.9
2 1981-01-03 18.8
Desired Dataset format:
month day temperature
1 1 20.7
1 2 17.9
1 3 18.8
1 4 14.6
import the data into pandas dataframe, you'll have 3 columns, then the date column split it as:
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
where df is your dataframe and you will have your split data

Python - Select min values in dataframe

I have a data frame that looks like this:
How can I make a new data frame that contains only the minimum 'Time' values for a user on the same date?
So I want to have a data frame with the same structure, but only one 'Time' for a 'Date' for a user.
So it should be like this:
Sort values by time column and check for duplicates in Date+User_name. However to make sure 09:00 is lower than 10:00 we can convert the strings to time first.
import pandas as pd
data = {
'User_name':['user1','user1','user1', 'user2'],
'Date':['8/29/2016','8/29/2016', '8/31/2016', '8/31/2016'],
'Time':['9:07:41','9:07:42','9:07:43', '9:31:35']
}
# Recreate sample dataframe
df = pd.DataFrame(data)
Alternative 1 (quicker):
#100 loops, best of 3: 1.73 ms per loop
# Create a mask
m = (df.reindex(pd.to_datetime(df['Time']).sort_values().index)
.duplicated(['Date','User_name']))
# Apply inverted mask
df = df.loc[~m]
Alternative 2 (more readable):
One easier way would be too remake the df['Time'] column to datetime and group it by date and User_name and get the idxmin(). This will be our mask. (Credit to jezrael)
# 100 loops, best of 3: 4.34 ms per loop
# Create a mask
m = pd.to_datetime(df['Time']).groupby([df['Date'],df['User_name']]).idxmin()
df = df.loc[m]
Output:
Date Time User_name
0 8/29/2016 9:07:41 user1
2 8/31/2016 9:07:43 user1
3 8/31/2016 9:31:35 user2
Update 1
#User included into grouping
Not the best way but simple
df = pd.DataFrame(np.datetime64('2016')+
np.random.randint(0,3*24,
size=(7,1)).astype('<m8[h]'),
columns =['DT']).join(pd.Series(list('abcdefg'),name='str_val')
).join(pd.Series(list('UAUAUAU'),name='User'))
df['Date'] = df.DT.dt.date
df['Time'] = df.DT.dt.time
df.drop(columns = ['DT'],inplace=True)
print (df)
Output:
str_val User Date Time
0 a U 2016-01-01 04:00:00
1 b A 2016-01-01 10:00:00
2 c U 2016-01-01 20:00:00
3 d A 2016-01-01 22:00:00
4 e U 2016-01-02 04:00:00
5 f A 2016-01-02 23:00:00
6 g U 2016-01-02 09:00:00
Code to get values
print (df.sort_values(['Date','User','Time']).groupby(['Date','User']).first())
Output:
Date User
2016-01-01 A b 10:00:00
U a 04:00:00
2016-01-02 A f 23:00:00
U e 04:00:00

Categories

Resources