filter data frame by specific dates - python

I have the data frame below with dates ranging from 2016-01-01 to 2021-03-27
timestamp close circulating_supply issuance_native
0 2016-01-01 0.944695 7.389026e+07 26070.31250
1 2016-01-02 0.931646 7.391764e+07 27383.90625
2 2016-01-03 0.962863 7.394532e+07 27675.78125
3 2016-01-04 0.944515 7.397274e+07 27420.62500
4 2016-01-05 0.950312 7.400058e+07 27839.21875
I'm looking to filter this dataframe by Month & Day to look at the circulating supply on December 31st for each year.
here is an output of the datatypes of the data frame
timestamp datetime64[ns]
close float64
circulating_supply float64
issuance_native float64
dtype: object
I'm able to pull single rows using this:
ts = pd.to_datetime('2016-12-31')
df.loc[df['timestamp'] == td]
but no luck passing in a list of datetimes inside df.loc[]
The result should look like this, showing the rows for December 31st of each year:
timestamp close circulating_supply issuance_native
0 2016-31-12 0.944695 7.389026e+07 26070.31250
1 2017-31-12 0.931646 7.391764e+07 27383.90625
2 2018-31-12 0.962863 7.394532e+07 27675.78125
3 2019-31-12 0.944515 7.397274e+07 27420.62500
4 2020-31-12 0.950312 7.400058e+07 27839.21875
This is the closest Ive gotten but I get this error
#query dataframe for the circulating supply at the end of the year
circulating_supply = df.query("timestamp == '2016-12-31' or timestamp =='2017-12-31' or timestamp =='2018-12-31' or timestamp =='2019-12-31' or timestamp =='2020-12-31' or timestamp =='2021-03-01'")
​
circulating_supply.drop(columns=['close', 'issuance_native'], inplace=True)
circulating_supply.copy()
circulating_supply.head()
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return super().drop(

Try something like this:
end_of_year = [
pd.to_datetime(ts)
for ts in [
"2016-12-31",
"2017-12-31",
"2018-12-31",
"2019-12-31",
"2020-12-31",
"2021-03-01",
]
]
end_of_year_df = df.loc[df["timestamp"].isin(end_of_year), :]
circulating_supply = end_of_year_df.drop(columns=["close", "issuance_native"])
circulating_supply.head()

I was able to solve this by ignoring the error I got when using the .drop() function on my df.query result
#query dataframe for the circulating supply at the end of the year
circulating_supply = df.query("timestamp == '2016-12-31' or timestamp =='2017-12-31' or timestamp =='2018-12-31' or timestamp =='2019-12-31' or timestamp =='2020-12-31' or timestamp =='2021-03-01'")
circulating_supply.drop(columns=['close', 'issuance_native'], inplace=True)
circulating_supply.copy() #not sure if this did anything
circulating_supply.head()
#add the column
yearly_issuance['EOY Supply'] = circulating_supply['circulating_supply'].values
yearly_issuance.head()

Related

How can I filter for rows one hour before and after a set timestamp in Python?

I am trying to filter a DataFrame to only show values 1-hour before and 1-hour after a specified time/date, but am having trouble finding the right function for this. I am working in Python with Pandas.
The posts I see regarding masking by date mostly cover the case of masking rows between a specified start and end date, but I am having trouble finding help on how to mask rows based around a single date.
I have time series data as a DataFrame that spans about a year, so thousands of rows. This data is at 1-minute intervals, and so each row corresponds to a row ID, a timestamp, and a value.
Example of DataFrame:
ID timestamp value
0 2011-01-15 03:25:00 34
1 2011-01-15 03:26:00 36
2 2011-01-15 03:27:00 37
3 2011-01-15 03:28:00 37
4 2011-01-15 03:29:00 39
5 2011-01-15 03:30:00 29
6 2011-01-15 03:31:00 28
...
I am trying to create a function that outputs a DataFrame that is the initial DataFrame, but only rows for 1-hour before and 1-hour after a specified timestamp, and so only rows within this specified 2-hour window.
To be more clear:
I have a DataFrame that has 1-minute interval data throughout a year (as exemplified above).
I now identify a specific timestamp: 2011-07-14 06:15:00
I now want to output a DataFrame that is the initial input DataFrame, but now only contains rows that are within 1-hour before 2011-07-14 06:15:00, and 1-hour after 2011-07-14 06:15:00.
Do you know how I can do this? I understand that I could just create a filter where I get rid of all values before 2011-07-14 05:15:00 and 2011-07-14 07:15:00, but my goal is to have the user simply enter a single date/time (e.g. 2011-07-14 06:15:00) to produce the output DataFrame.
This is what I have tried so far:
hour = pd.DateOffset(hours=1)
date = pd.Timestamp("2011-07-14 06:15:00")
df = df.set_index("timestamp")
df([date - hour: date + hour])
which returns:
File "<ipython-input-49-d42254baba8f>", line 4
df([date - hour: date + hour])
^
SyntaxError: invalid syntax
I am not sure if this is really only a syntax error, or something deeper and more complex. How can I fix this?
Thanks!
You can do with:
import pandas as pd
import datetime as dt
data = {"date": ["2011-01-15 03:10:00","2011-01-15 03:40:00","2011-01-15 04:10:00","2011-01-15 04:40:00","2011-01-15 05:10:00","2011-01-15 07:10:00"],
"value":[1,2,3,4,5,6]}
df=pd.DataFrame(data)
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S', errors='ignore')
date_search= dt.datetime.strptime("2011-01-15 05:20:00",'%Y-%m-%d %H:%M:%S')
mask = (df['date'] > date_search-dt.timedelta(hours = 1)) & (df['date'] <= date_search+dt.timedelta(hours = 1))
print(df.loc[mask])
result:
date value
3 2011-01-15 04:40:00 4
4 2011-01-15 05:10:00 5

Plotting datetime index

I am running into an error with a grouped by date dataframe:
byDate = df.groupby('Date').count()
Date Value
2019-08-15 2
2019-08-19 1
2019-08-23 7
2019-08-28 4
2019-09-04 7
2019-09-09 2
I know that type(df["Date"].iloc[0])
returns datetime.date
I want to plot the data in such a way, that days, for which no value is available are shown as 0.
I have played around with
ax = sns.lineplot(x=byDate.index.fillna(0), y="Value", data=byDate)
I am however only able to get this output, where the y-axis indicates that a line is not drawn to 0 for days for which no value is available.
Have you ever tried creating a new dataFrame object indexed by all the dates ranging from startDate to endDate and then filling in the missing values with 0.0?
The output would looks something like:
dates = pd.to_datetime(['2019-08-15','2019-08-19','2019-08-23','2019-08-28','2019-09-04','2019-09-09']).date
byDate = pd.DataFrame({'Value':[2,1,7,4,7,2]},index=dates)
startDate = byDate.index.min()
endDate = byDate.index.max()
newDates = pd.date_range(startDate,endDate, periods=(endDate - startDate).days).date.tolist()
newDatesDf = pd.DataFrame( index=newDates)
newByDate = pd.concat([newDatesDf,byDate],1).fillna(0)
sns.lineplot(x=newByDate.index, y="Value", data=newByDate)
output

Pandas Manipulating Freq for Business Day DateRange

I am trying to add a set of common date related columns to my data frame and my approach to building these date columns is off the .date_range() pandas method that will have the date range for my dataframe.
While I can use methods like .index.day or .index.weekday_name for general date columns, I would like to set a business day column based on date_range I constructed, but not sure if I can use the freq attribute nickname 'B' or if I need to create a new date range.
Further, I am hoping to not count those business days based on a list of holiday dates that I have.
Here is my setup:
Holiday table
holiday_table = holiday_table.set_index('date')
holiday_table_dates = holiday_table.index.to_list() # ['2019-12-31', etc..]
Base Date Table
data_date_range = pd.date_range(start=date_range_start, end=date_range_end)
df = pd.DataFrame({'date': data_date_range}).set_index('date')
df['day_index'] = df.index.day
# Weekday Name
df['weekday_name'] = df.index.weekday_name
# Business day
df['business_day'] = data_date_range.freq("B")
Error at df['business_day'] = data_date_range.freq("B"):
---> 13 df['business_day'] = data_date_range.freq("B")
ApplyTypeError: Unhandled type: str
OK, I think I understand your question now. You are looking to create a a new column of working business days (excluding your custom holidays). In my example i just used the regular US holidays from pandas but you already have your holidays as a list in holiday_table_dates but you should still be able to follow the general layout of my example for your specific use. I also used the assumption that you are OK with boolean values for your business_day column:
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as h_cal
# sample data
data_date_range = pd.date_range(start='1/1/2019', end='12/31/2019')
df = pd.DataFrame({'date': data_date_range}).set_index('date')
df['day_index'] = df.index.day
# Weekday Name
df['weekday_name'] = df.index.weekday_name
# this is just a sample using US holidays
hday = h_cal().holidays(df.index.min(), df.index.max())
# b is the same date range as bove just with the freq set to business days
b = pd.date_range(start='1/1/2019', end='12/31/2019', freq='B')
# find all the working business day where b is not a holiday
bday = b[~b.isin(hday)]
# create a boolean col where the date index is in your custom business day we just created
df['bday'] = df.index.isin(bday)
day_index weekday_name bday
date
2019-01-01 1 Tuesday False
2019-01-02 2 Wednesday True
2019-01-03 3 Thursday True
2019-01-04 4 Friday True
2019-01-05 5 Saturday False

pandas change column type to datetime afterr group by

This is related to a previous question which I asked here (pandas average by timestamp and day of the week).
Here, I perform a groupby operation as follows:
df = pd.DataFrame(np.random.random(2838),index=pd.date_range('2019-09-13 12:40:00', periods=2838, freq='5T'))
# Reset the index
df.reset_index(inplace=True)
df.groupby(df.index.dt.strftime('%A %H:%M')).mean()
df.reset_index(inplace=True)
Now if I check the data types of the column, we have:
index object
0 float64
The column does not retain its datetime data type. How can I still preserve the column data type?
I wouldn't do grouping like that, instead, I would do double grouping/indexing:
days = df.index.day_name()
times = df.index.time
df.groupby([days,times]).mean()
which gives (head):
0
Friday 00:00:00 0.524322
00:05:00 0.857684
00:10:00 0.593461
00:15:00 0.755158
00:20:00 0.049511
where the first level index is the (string) day names, and second level index are datetime type.

How to get the number of business days between two dates in pandas

I have the following column in a dataframe, I would like to add a column to the end of this dataframe, where the column has the business days from today (6/24) to the previous day.
Bday() function does not seem to have this capability.
Date
2019-6-21
2019-6-20
2019-6-14
I am looking for a result that looks like following:
Date Business days
2019-6-21 1
2019-6-20 2
2019-6-14 6
Is there an easy way to do this, other than doing individual manipulations or using datetime library
Use np.busday_count:
# df['Date'] = pd.to_datetime(df['Date']) # if needed
np.busday_count(df['Date'].dt.date, np.datetime64('today'))
# array([1, 2, 6])
df['bdays'] = np.busday_count(df['Date'].dt.date, np.datetime64('today'))
df
Date bdays
0 2019-06-21 1
1 2019-06-20 2
2 2019-06-14 6

Categories

Resources