extracting the last day of the quarter from dataframes - python

I have a daily time series and I want to pick out the data for the last day of the quarter. I tried doing this by generating a series for the last day of the quarter and merging it with the other dataframe, but to no avail.
My Python code is here:
import pandas as pd
import numpy as np
s1 = pd.read_csv(r"C:\Users\Tim Peterson\Documents\Tom\Rocky\DJIA.csv", index_col=0,parse_dates=True)
ds1 = pd.DataFrame(s1, columns=[ 'DJIA'])
date1 = "2014-10-10" # input start date
date2 = "2016-01-07" # input end date
month_list = [i.strftime("%b-%y") for i in pd.date_range(start=date1, end=date2, freq='MS')]
ds2 =pd.date_range(date1, date2, freq='BQ')
eom = pd.DataFrame(ds2 )
mergedDf = ds1.merge(eom, left_index=True, right_index=True)
print(mergedDf)
when I run this I get
Empty DataFrame
Columns: [DJIA, 0]
Index: []

IIUC use:
dates = pd.date_range(date1, date2, freq='BQ')
out = ds1[ds1.index.isin(dates)]

Related

Pandas not appending after first column

I have two dataframes. One contains a column that contains the date of earnings for a stock. The other contains the all the prices for the stock, keep in mind that the index is the date. I want to get the prices of a stock N days before and after earnings and store it in a new dataframe column wise. This is what I have so far
earningsPrices = pd.DataFrame()
for date in dates:
earningsPrices[date] = prices[date - pd.Timedelta(days=N):date + pd.Timedelta(days=N)]
print(earningsPrices)
and this is the output
The problem is that it only writes the prices for the first date, and not the rest.
You should maybe take this approach:
earningsPrices = pd.DataFrame(index=dates, columns=['price1', 'price2', 'price3'])
for date in dates:
start_date = date - pd.Timedelta(days=N)
end_date = date + pd.Timedelta(days=N)
selected_rows = prices.loc[prices['date_column'].between(start_date, end_date)]
earningsPrices.loc[date, 'price1'] = selected_rows['price1'].values
earningsPrices.loc[date, 'price2'] = selected_rows['price2'].values
earningsPrices.loc[date, 'price3'] = selected_rows['price3'].values
print(earningsPrices)
use concat
for date in dates:
earningsPeriod = prices[date - pd.Timedelta(days=window):date + pd.Timedelta(days=window)].reset_index(drop=True)
earningsPrices = pd.concat([earningsPrices, earningsPeriod], axis=1)

Change year based on start and end date in dataframe

I had a column in data frame called startEndDate, example: '10.12-20.05.2019', divided those to columns start_date and end_date with same year, example: start_date '10.12.2019' and end_date '20.05.2019'. But year in this example is wrong, as it should be 2018 because start date cannot be after end date. How can I compare entire dataframe and replace values so it contains correct start_dates based on if statement(because some start dates should stay with year as 2019)?
This will show you which rows the start_date is > than the end date
data = {
'Start_Date' : ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'],
'End_Date' : ['2020-02-01', '2019-01-02', '2019-01-03', '2020-01-05']
}
df = pd.DataFrame(data)
df['Start_Date'] = pd.to_datetime(df['Start_Date'], infer_datetime_format=True)
df['End_Date'] = pd.to_datetime(df['End_Date'], infer_datetime_format=True)
df['Check'] = np.where(df['Start_Date'] > df['End_Date'], 'Error', 'No Error')
df
Without seeing more of your data or your intended final data this is the best we will be able to do to help identify problems in the data.
This method first splits up the date string to two dates and creates start and end date columns. Then it subtracts 1 year from the start date if it is greater than the end date.
import pandas as pd
import numpy as np
# mock data
df = pd.DataFrame({"dates": ["10.12-20.05.2019", "02.04-31.10.2019"]})
# split date string to two dates, convert to datetime and stack to columns
df[["start", "end"]] = np.vstack(
df.dates.apply(lambda x: pd.to_datetime(
[x.split("-")[0] + x[-5:],
x.split("-")[1]], format="%d.%m.%Y")))
# subtract 1 year from start date if greater than end date
df["start"] = np.where(df["start"]>df["end"],
df["start"] - pd.DateOffset(years=1),
df["start"])
df
# dates start end
#0 10.12-20.05.2019 2018-12-10 2019-05-20
#1 02.04-31.10.2019 2019-04-02 2019-10-31
Although I have used split here for the initial splitting of the string, as there will always be 5 characters before the hyphen, and the date will always be the last 5 (with the .), there is no need to use the split and instead that line could change to:
df[["start", "end"]] = np.vstack(
df.dates.apply(lambda x: pd.to_datetime(
[x[:5] + x[-5:],
x[6:]], format="%d.%m.%Y")))

How to exclude weekends and holidays from finding the difference between two dates in python

I need to find the difference between 2 dates where certain end dates are blank. I am need to exclude the weekends, as well as the holidays when calculating the dates. I also need to put into account the blank end_dates.
I have a data frame which looks like:
start_date
end_date
01-01-2020
05-01-2020
30-10-2021
NaT
15-08-2019
NaT
29-06-2020
15-07-2020
The code for retrieving the holidays I wrote as the following:
df = read_excel(r'dates.xlsx')
df.head()
us_holidays = holidays.UnitesStates()
The following code works around the null values and it excludes the weekends
def business_days(start, end):
mask = pd.notnull(start) & pd.notnull(end)
start = start.values.astype('datetime64[D]')[mask]
end = end.values.astype('datetime64[D]')[mask]
holi = us_holidays.values.astype('datetime64[D]')[mask]
result = np.empty(len(mask), dtype=float)
result[mask] = np.busday_count(start, end, holidays= holi)
result[~mask] = np.nan
return result
df['count'] = business_days(df['start_date'], df['end_date'])
The error I get is:
AttributeError: 'builtin_function_or_method' object has no attribute 'astype'
How can I fix the following error?
Any help will be greatly appreciated, thanks.
I'm not familiar with the holiday package. But holidays.UnitesStates() seems to return an object and not the needed
list of dates. However you can create a list of holiday dates for a certain range of years.
I'm not sure why you get "NaT", ususally you get NaNs. But you can handle both.
One way to do it:
import holidays
import pandas as pd
import numpy as np
import datetime
#Create Dummy DataFrame:
df = pd.DataFrame(columns=['start_date','end_date'])
df['start_date'] = np.array(["2020-01-01","2021-10-30","2019-08-15","2020-06-29"])
df['end_date'] = np.array(["2020-01-05","NaT","NaT", "2020-07-15"])
#Convert Columns to datetime
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
#Convert DateTime to Date
df['start_date'] = df['start_date'].dt.date
df['end_date'] = df['end_date'].dt.date
#Not sure why you get NaT when you read the file with pandas. So replace it with today:
df = df.replace({'NaT': datetime.date.today()})
#In case you get a NaN:
df = df.fillna(datetime.date.today())
#Get First and Last Year
max_year = df.max().max().year
min_year = df.min().min().year
#holidays.US returns an object, so you have to create a list
us_holidays = list()
for date,name in sorted(holidays.US(years=list(range(min_year,max_year+1))).items()):
us_holidays.append(date)
start_dates = list(df['start_date'].values)
end_dates = list(df['end_date'].values)
df['count'] = np.busday_count(start_dates, end_dates, holidays = us_holidays)

Convert Stacked DataFrame of Years and Months to DataFrame with Datetime Indices

I am reading a csv file of the number of employees in the US by year and month (in thousands). It starts out like this:
Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
1961,45119,44969,45051,44997,45119,45289,45400,45535,45591,45716,45931,46035
1962,46040,46309,46375,46679,46668,46644,46720,46775,46888,46927,46910,46901
1963,46912,47000,47077,47316,47328,47356,47461,47542,47661,47805,47771,47863
...
I want my Pandas Dataframe to have the datetime as the index for each month's value. I'm doing this so I can later add values for specific time ranges. I want it to look something like this:
1961-01-01 45119.0
1961-02-01 44969.0
1961-03-01 45051.0
1961-04-01 44997.0
1961-05-01 45119.0
...
I did some research and thought that if I stacked the years and months together, I could combine them into a datetime. Here is what I have done:
import pandas as pd
import numpy as np
df = pd.read_csv("BLS_private.csv", header=5, index_col="Year")
df.columns = range(1, 13) # I transformed months into numbers 1-12 for easier datetime conversion
df = df.stack() # Months are no longer columns
print(df)
Here is my output:
Year
1961 1 45119.0
2 44969.0
3 45051.0
4 44997.0
5 45119.0
...
I do not know how to combine the year and the months in the stacked indices. Does stacking the indices help at all in my case? I am also not the most familiar with Pandas datetime, so any explanation about how I could use that would be very helpful. Also if anyone has alternate solutions than making datetime the index, I welcome ideas.
After the stack create the DateTimeIndex from the current index
from datetime import datetime
dt_index = pd.to_datetime([datetime(year=year, month=month, day=1)
for year, month in df.index.values])
df.index = dt_index
df.head(3)
# 1961-01-01 45119
# 1961-02-01 44969
# 1961-03-01 45051
import pandas as pd
df = pd.read_csv("BLS_private.csv", index_col="Year")
dates = pd.date_range(start=str(df.index[0]), end=str(df.index[-1] + 1), closed='left', freq="MS")
df = df.stack()
df.index = dates
df.to_frame()
s = """Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
1961,45119,44969,45051,44997,45119,45289,45400,45535,45591,45716,45931,46035
1962,46040,46309,46375,46679,46668,46644,46720,46775,46888,46927,46910,46901
1963,46912,47000,47077,47316,47328,47356,47461,47542,47661,47805,47771,47863"""
df = pd.read_csv(StringIO(s))
# set index and stack
stack = df.set_index('Year').stack().reset_index()
# create a new index
stack.index = pd.to_datetime(stack['Year'].astype(str) +'-'+ stack['level_1'])
# remove columns
final = stack[0].to_frame()
1961-01-01 45119
1961-02-01 44969
1961-03-01 45051
1961-04-01 44997
1961-05-01 45119
1961-06-01 45289

difference between 2 dates in days saved into new column as float

I have a python dataframe with 2 columns that contain dates as strings e.g. start_date "2002-06-12" and end_date "2009-03-01". I would like to calculate the difference (days) between these 2 columns for each row and save the results into a new column called for example time_diff of type float.
I have tried:
df["time_diff"] = (pd.Timestamp(df.end_date) - pd.Timestamp(df.start_date )).astype("timedelta64[d]")
pd.to_numeric(df["time_diff"])
based on some tutorials but this gives TypeError: Cannot convert input for the first line. What do I need to change to get this running?
Here is a working example of converting a string column of a dataframe to datetime type and saving the time difference between the datetime columns in a new column as a float data type( number of seconds)
import pandas as pd
from datetime import timedelta
tmp = [("2002-06-12","2009-03-01"),("2016-04-28","2022-03-14")]
df = pd.DataFrame(tmp,columns=["col1","col2"])
df["col1"]=pd.to_datetime(df["col1"])
df["col2"]=pd.to_datetime(df["col2"])
df["time_diff"]=df["col2"]-df["col1"]
df["time_diff"]=df["time_diff"].apply(timedelta.total_seconds)
Time difference in seconds can be converted to minutes or days by using simple math.
Try:
import numpy as np
enddates = np.asarray([pd.Timestamp(end) for end in df.end_date.values])
startdates = np.asarray([pd.Timestamp(start) for start in df.start_date.values])
df['time_diff'] = (enddates - startdates).astype("timedelta64")
First convert strings to datetime, then calculate difference in days.
df['start_date'] = pd.to_datetime(df['start_date'], format='%Y-%m-%d')
df['end_date'] = pd.to_datetime(df['end_date'], format='%Y-%m-%d')
df['time_diff'] = (df.end_date - df.start_date).dt.days
You can also do it by converting your columns into date and then computing the difference :
from datetime import datetime
df = pd.DataFrame({'Start Date' : ['2002-06-12', '2002-06-12' ], 'End date' : ['2009-03-01', '2009-03-06']})
df['Start Date'] = [ datetime.strptime(x, "%Y-%m-%d").date() for x in df['Start Date'] ]
df['End date'] = [ datetime.strptime(x, "%Y-%m-%d").date() for x in df['End date'] ]
df['Diff'] = df['End date'] - df['Start Date']
Out :
End date Start Date Diff
0 2009-03-01 2002-06-12 2454 days
1 2009-03-06 2002-06-12 2459 days
You should just use pd.to_datetime to convert your string values:
df["time_diff"] = (pd.to_datetime(df.end_date) - pd.to_datetime(df.start_date))
The resul will automatically be a timedelta64
You can try this :
df = pd.DataFrame()
df['Arrived'] = [pd.Timestamp('01-04-2017')]
df['Left'] = [pd.Timestamp('01-06-2017')]
diff = df['Left'] - df['Arrived']
days = pd.Series(delta.days for delta in (diff)
result = days[0]

Categories

Resources