How to obtain the below result
Current Month is the column which is to be calculated. We need to get the increment every month starting from Jan-18 for every account id.
Every Account First row/ Record will start from JAN-18, and Second Row will be Feb-18 an so on. We need to increment from Jan-18 till last observation is there for that account id
Above shown is for a sample account and the same has to be applied for multiple account id.
You could achieve what you are looking for as follows:
import pandas as pd
from datetime import date
acct_id = "123456789"
loan_start_date = date(2018, 1, 31)
current_date = date.today()
dates = pd.date_range(loan_start_date,current_date, freq='M').strftime("%b-%y")
df_columns = [acct_id, loan_start_date, dates]
df = pd.DataFrame()
df["current_month"] = dates
df["acct_id"] = acct_id
df["loan_start_date"] = loan_start_date
df = df[["acct_id", "loan_start_date", "current_month"]]
print(df.head())
Related
I have two dataframes. One contains a column that contains the date of earnings for a stock. The other contains the all the prices for the stock, keep in mind that the index is the date. I want to get the prices of a stock N days before and after earnings and store it in a new dataframe column wise. This is what I have so far
earningsPrices = pd.DataFrame()
for date in dates:
earningsPrices[date] = prices[date - pd.Timedelta(days=N):date + pd.Timedelta(days=N)]
print(earningsPrices)
and this is the output
The problem is that it only writes the prices for the first date, and not the rest.
You should maybe take this approach:
earningsPrices = pd.DataFrame(index=dates, columns=['price1', 'price2', 'price3'])
for date in dates:
start_date = date - pd.Timedelta(days=N)
end_date = date + pd.Timedelta(days=N)
selected_rows = prices.loc[prices['date_column'].between(start_date, end_date)]
earningsPrices.loc[date, 'price1'] = selected_rows['price1'].values
earningsPrices.loc[date, 'price2'] = selected_rows['price2'].values
earningsPrices.loc[date, 'price3'] = selected_rows['price3'].values
print(earningsPrices)
use concat
for date in dates:
earningsPeriod = prices[date - pd.Timedelta(days=window):date + pd.Timedelta(days=window)].reset_index(drop=True)
earningsPrices = pd.concat([earningsPrices, earningsPeriod], axis=1)
Here is the code for sample simulated data. Actual data can have varying start and end dates.
import pandas as pd
import numpy as np
dates = pd.date_range("20100121", periods=3653)
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))
dfb=df.resample('B').apply(lambda x:x[-1])
From the dfb, I want to select the rows that contain values for all the days of the month.
In dfb, 2010 January and 2020 January have incomplete data. So I would like data from 2010 Feb till 2019 December.
For this particular dataset, I could do
df_out=dfb['2010-02':'2019-12']
But please help me with a better solution
Edit-- Seems there is plenty of confusion in the question. I want to omit rows that does not begin with first day of the month and rows that does not end on last day of the month. Hope that's clear.
When you say "better" solution - I assume you mean make the range dynamic based on input data.
OK, since you mention that your data is continuous after the start date - it is a safe assumption that dates are sorted in increasing order. With this in mind, consider the code:
import pandas as pd
import numpy as np
from datetime import date, timedelta
dates = pd.date_range("20100121", periods=3653)
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))
print(df)
dfb=df.resample('B').apply(lambda x:x[-1])
# fd is the first index in your dataframe
fd = df.index[0]
first_day_of_next_month = fd
# checks if the first month data is incomplete, i.e. does not start with date = 1
if ( fd.day != 1 ):
new_month = fd.month + 1
if ( fd.month == 12 ):
new_month = 1
first_day_of_next_month = fd.replace(day=1).replace(month=new_month)
else:
first_day_of_next_month = fd
# ld is the last index in your dataframe
ld = df.index[-1]
# computes the next day
next_day = ld + timedelta(days=1)
if ( next_day.month > ld.month ):
last_day_of_prev_month = ld # keeps the index if month is changed
else:
last_day_of_prev_month = ld.replace(day=1) - timedelta(days=1)
df_out=dfb[first_day_of_next_month:last_day_of_prev_month]
There is another way to use dateutil.relativedelta but you will need to install python-dateutil module. The above solution attempts to do it without using any extra modules.
I assume that in the general case the table is chronologically ordered (if not use .sort_index). The idea is to extract the year and month from the date and select only the lines where (year, month) is not equal to the first and last lines.
dfb['year'] = dfb.index.year # col#1
dfb['month'] = dfb.index.month # col#2
first_month = (dfb['year']==dfb.iloc[0, 1]) & (dfb['month']==dfb.iloc[0, 2])
last_month = (dfb['year']==dfb.iloc[-1, 1]) & (dfb['month']==dfb.iloc[-1, 2])
dfb = dfb.loc[(~first_month) & (~last_month)]
dfb = dfb.drop(['year', 'month'], axis=1)
I have a csv file with 2 datetimes (pre-start and pre-end) in each row, as well as a list of datetimes (install_list).
I am trying to iterate through the csv file and add a column that returns the total number of dates from the install_list that are between the pre-start time and the pre-end time in each row.
I am using the code below, but it is returning the total number of items in the list for each row in the csv.
example: File 1 = start time, end time
List 1 = install time
Desired Result for Each Row = IF Install Time >= Start Time AND Install Time <= End Time, SUM(Installs)
Col1 (Start Time): 1/1/21 12:00:00 PM
Col2 (End Time): 1/1/21 12:10:00 PM
Install Time List = [1/1/21 12:05:00 PM, 1/1/21 12:11:00 PM]
Desired Result for Row1/Col3 = 1
Code Below:
import datetime
import pandas as pd
from collections import Counter
df_post_logs = pd.read_csv('logs_merged.csv',index_col=False)
df_installs = pd.read_csv('install_merge.csv',index_col=False)
'''Convert UTC to EST on Installs Add Column'''
df_installs['conversion date'] = pd.to_datetime(df_installs['conversion date'],infer_datetime_format='%Y-%m-%d')
df_installs['conversion time'] = pd.to_datetime(df_installs['conversion time'],infer_datetime_format='%H:%S:%M')
utc_datetime = df_installs['conversion time']
est_datetime = utc_datetime - datetime.timedelta(hours=5)
df_installs['utc datetime'] = utc_datetime
df_installs['est datetime'] = est_datetime
'''Add Column 10 Minutes Pre-Spot Time to Post Logs/10 Minutes Post Time to Spot'''
df_post_logs['Air Date'] = pd.to_datetime(df_post_logs['Air Date'],infer_datetime_format='%Y-%m-%d')
df_post_logs['Air Time'] = pd.to_datetime(df_post_logs['Air Time'],infer_datetime_format='%H:%S:%M')
timestamp = df_post_logs['Air Time']
df_post_logs['timestamp'] = timestamp
df_post_logs['pre spot time start'] = timestamp - datetime.timedelta(minutes=10, seconds=1)
df_post_logs['pre spot time end'] = timestamp - datetime.timedelta(seconds=1)
df_post_logs['post spot time'] = timestamp + datetime.timedelta(minutes=10)
'''SUM of Installs between pre-spot time'''
install_list = pd.to_datetime(df_installs['est datetime']).to_list()
for pre_spot_start in df_post_logs['pre spot time start']:
pre_spot_start_time = pre_spot_start
for pre_spot_end in df_post_logs['pre spot time end']:
pre_spot_end_time = pre_spot_end
for pre_spot_end in df_post_logs['pre spot time end']:
pre_spot_end_time = pre_spot_end
pre_spot_install = 0
for row in df_post_logs:
for date in install_list:
if date >= pre_spot_start_time and date <= pre_spot_end_time:
pre_spot_install = pre_spot_install+1
df_post_logs['Pre Spot Install'] = pre_spot_install
df_post_logs.to_csv('Test.csv')
The following code will print for each row, how many values in install_dates are between the respective values in the start and end columns of the dataframe:
import pandas as pd
df = pd.DataFrame({
"start": pd.to_datetime(["2018-07-11", "2018-06-10"]),
"end": pd.to_datetime(["2018-07-20", "2018-06-30"]),
})
install_dates = pd.to_datetime(["2018-06-25", "2018-07-01", "2018-07-15", "2018-07-18"])
def num_install_dates_between_start_and_end(row):
return len([d for d in install_dates if row["start"] <= d <= row["end"]])
print(df.agg(num_install_dates_between_start_and_end, axis="columns"))
It uses agg to collapse the information of a row to one number.
The way how information is "collapsed" is specified in the function num_install_dates_between_start_and_end, which counts how many elements from install_dates are between the start/end value in the row.
the problem is i can't think of a way to get the 'mark' from the last day of the previous month.
because I need to compare the current month with the previous month
It's a number generated every day, Mark_LastDayData takes as reference the mark of the last day of the month and replaces it in all values of that same month. 'Mark_LastDayDate_PreviousMonth' 4 it would be like getting the 'Mark_LastDayData' from the previous month so I can make a comparison in the future
I have the following DF
import pandas as pd
from pandas.tseries.offsets import BMonthEnd
import datetime as dt
df = pd.DataFrame({'Found':['A','A','A','A','A','B','B','B'],
'Date':['14/10/2021','19/10/2021','29/10/2021','30/09/2021','20/09/2021','20/10/2021','29/10/2021','15/10/2021'],
#'LastDayMonth':['29/10/2021','29/10/2021','29/10/2021','30/09/2021','30/09/2021','29/10/2021','29/10/2021','29/10/2021'],
'Mark':[1,2,3,4,3,1,2,3]
})
print(df)
**LastDayMonth was obtained through the code
I made some changes to the date
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = pd.to_datetime(df['Date'], format = '%Y/%m/%d')
df['LastDayDate'] = pd.to_datetime(df['Date']) + BMonthEnd(0)
df['LastDayDatePrevMonth'] = pd.to_datetime(df['Date']) - pd.DateOffset(months=1)
I needed the 'Mark' of the last day of the month of each date so I used the method
df = df.merge(df.loc[df['Date'] == df['LastDayDate'], ['Found','LastDayDate','Mark']],
on=['Found', 'LastDayDate'],
how='left', suffixes=['', '_LastDayDate'])
How can I do this to get the 'mark' from the last day of the previous month
in the same column
Sample df that I filled in manually
I have a df as follows:
Date values
20190101000000 1384.4801224435887
20190101000001 1384.5053056232982
20190101000002 1384.5304889818935
20190101000003 1384.5556725193492
20190101000004 1384.5808562356392
20190101000005 1384.606040130739
20190101000006 1384.631224204622
20190101000007 1384.6564084572635
20190101000008 1384.6815928886372
20190101000009 1384.7067774987179
20190101000010 1384.7319622874802
20190101000011 1384.757147254898
20190101000012 1384.7823324009464
20190101000013 1384.8075177255998
20190101000014 1384.8327032288325
20190101000015 1384.8578889106184
20190101000016 1384.8830747709321
20190101000017 1384.9082608097488
20190101000018 1384.9334470270423
20190101000019 1384.958633422787
20190101000020 1384.9838199969574
20190101000021 1385.0090067495285
20190101000022 1385.034193680474
20190101000023 1385.0593807897685
20190101000024 1385.0845680773864
20190101000025 1385.1097555433028
20190101000026 1385.134943187491
20190101000027 1385.160131009926
20190101000028 1385.1853190105826
20190101000029 1385.2105071894343
20190101000030 1385.2356955464566
where the Date column is of the format %Y%m%d%H%M%S. I take start date and end date as the user inputs and split it in a frequency of 1 second.
Now, I would like to take a second value of frequency from the user and obtain the value from the values column at that instant.
Example:
If the second resolution is 10secs, then the output must be as follows:
start end value
20190101000000 20190101000010 1384.7319622874802
20190101000011 20190101000020 1384.9838199969574
20190101000021 20190101000030 1385.2356955464566
from the above df, we can see that if the resolution is 10sec, then the value at every 10th second must be obtained.
If the second resolution is 15mins, then the output must be as follows:
start end values
20190101000000 20190101001500 1407.2142300429964
20190101001501 20190101003000 1416.6996533329484
20190101003001 20190101004500 1424.2467631293005
How can this be done?
My code till now:
import datetime
import pandas as pd
START_DATE = str(input('Enter start date in %Y-%m-%d %H:%M:%S format: '))
END_DATE = str(input('Enter end date in %Y-%m-%d %H:%M:%S format: '))
RESOLUTION = 'S'
dates = pd.date_range(START_DATE, END_DATE, freq = RESOLUTION)
dates = pd.DataFrame(pd.Series(dates).dt.strftime('%Y%m%d%H%M%S'), columns = ['Date'])
Compare values of datetimes converted to underline format with modulo by timedelta, then crete new column by DataFrame.insert and Series.shift, last remove first row with iloc:
res = '10s'
m = pd.to_datetime(df['Date']).to_numpy().astype(np.int64) % pd.Timedelta(res).value == 0
df = df[m].rename(columns={'Date':'end'})
df.insert(0, 'start', df['end'].shift())
df = df.iloc[1:]
print (df)
start end values
10 20190101000000 20190101000010 1384.7319622874802
20 20190101000010 20190101000020 1384.9838199969574
30 20190101000020 20190101000030 1385.2356955464566
Last for add 1 second use:
df.loc[df.index[1:], 'start'] = (pd.to_datetime(df.loc[df.index[1:], 'start']) +
pd.Timedelta('1s')).dt.strftime('%Y%m%d%H%M%S')
print (df)
start end values
10 20190101000000 20190101000010 1384.7319622874802
20 20190101000011 20190101000020 1384.9838199969574
30 20190101000021 20190101000030 1385.2356955464566
you have to change data type of dates ==>
import pandas as pd
start_date = pd.to_datetime(START_DATE)
end_date = pd.to_datetime(END_DATE)
Resolution = start_date.minute