dataframe groupby date and resample by seconds for daily change price - python

I have data like below.
I need to get the percentage change of current resample(10s)'s last price compared to daily open price(00:00:00) like below.
There are more than one compid.
I did something like below, but df_price_curr_last gets error.
df_t is the data
group = ['compid', df_t['datetime'].date]
df_price_open = df_t.groupby(group)['price'].first().to_frame()
df_price_open
df_price_curr_last = df_t.groupby(group).resample('10S')['price'].last()
df_price_curr_last/df_price_open
Below is the error msg.
ValueError: Key 2020-11-06 00:00:00 not in level Index([2020-11-06, 2020-11-07], dtype='object')

I think you can grouping by dates and also by Grouper with 10S, aggregate last and then grouping by first and second level (compid and date) with GroupBy.transform for repeat first value, so possible divide both Series:
grouper = ['compid',
df_t['datetime'].dt.date.rename('date'),
pd.Grouper(freq='10S', key='datetime')]
df_price_curr_last = df_t.groupby(grouper)['price'].last()
print (df_price_curr_last)
df_price_open = df_price_curr_last.groupby(level=[0,1]).transform('first')
a = df_price_curr_last/df_price_open

Related

Using a Rolling Function in Pandas based on Date and a Categorical Column

Im currently working on a dataset where I am using the rolling function in pandas to
create features.
The functions rely on three columns a DaysLate numeric column from which the mean is calculated from, an Invoice Date column from which the date is derived from and a customerID column which denotes the customer of a row.
Im trying to get a rolling mean of the DaysLate for the last 30 days limited to invoices raised to a specific customerID.
The following two functions are working.
Mean of DaysLate for the last five invoices raised for the row's customer
df["CustomerDaysLate_lastfiveinvoices"] = df.groupby("customerID").rolling(window = 5,min_periods = 1).\
DaysLate.mean().reset_index().set_index("level_1").\
sort_index()["DaysLate"]
Mean of DaysLate for all invoices raised in the last 30 days
df = df.sort_values('InvoiceDate')
df["GlobalDaysLate_30days"] = df.rolling(window = '30d', on = "InvoiceDate").DaysLate.mean()
Just cant seem to find the code get the mean of the last 30 days by CustomerID. Any help on above is greatly appreciated.
Set the date column as index then sort to ensure ascending order then group the sorted dataframe by customer id and for each group calculate 30d rolling mean.
mean_30d = (
df
.set_index('InnvoiceDate') # !important
.sort_index()
.groupby('customerID')
.rolling('30d')['DaysLate'].mean()
.reset_index(name='GlobalDaysLate_30days')
)
# merge the rolling mean back to original dataframe
result = df.merge(mean_30d)

How to search for a specific date within concatenated DataFrame TimeSeries. Same Date would repeat several times in a merged df

I downloaded historical price data for ^GSPC Share Market Index (S&P500), and several other Global Indices. Date is set as index.
Selecting values in rows when date is set to index works as expected with .loc.
# S&P500 DataFrame = spx_df
spx_df.loc['2010-01-04']
Open 1.116560e+03
High 1.133870e+03
Low 1.116560e+03
Close 1.132990e+03
Volume 3.991400e+09
Dividends 0.000000e+00
Stock Splits 0.000000e+00
Name: 2010-01-04 00:00:00-05:00, dtype: float64
I then concatenated several Stock Market Global Indices into a single DataFrame for further use. In effect, any date in range will be included five times when historical data for five Stock Indices are linked in a Time Series.
markets = pd.concat(ticker_list, axis = 0)
I want to reference a single date in concatenated df and set it as a variable. I would prefer if the said variable didn't represent a datetime object, because I would like to access it with .loc as part of def function. How does concatenate effect accessing rows via date as index if the same date repeats several times in a linked TimeSeries?
This is what I attempted so far:
# markets = concatenated DataFrame
Reference_date = markets.loc['2010-01-04']
# KeyError: '2010-01-04'
Reference_date = markets.loc[markets.Date == '2010-01-04']
# This doesn't work because Date is not an attribute of the DataFrame
Since you have set date as index you should be able to do:
Reference_date = markets.loc[markets.index == '2010-01-04']
To access a specific date in the concatenated DataFrame, you can use boolean indexing instead of .loc. This will return a DataFrame that contains all rows where the date equals the reference date:
reference_date = markets[markets.index == '2010-01-04']
You may also want to use query() method for searching for specific data
reference_date = markets.query('index == "2010-01-04"')
Keep in mind that the resulting variable reference_date is still a DataFrame and contains all rows that match the reference date across all the concatenated DataFrames. If you want to extract only specific columns, you can use the column name like this:
reference_date_Open = markets.query('index == "2010-01-04"')["Open"]

Python how to auto pick last Trade Day closing price

Hi I had created a dataframe with Acutal Close, High, Low and now I will have to calculate the Day-Change, 3Days-Change, 2weeks-Change for each of the row.
With the code below, I can see the Day-Change field with Blank/NaN value (10/27/2009 D-Chg field), and now how can I get python to Auto-Pick the last trading date (10/23/2009) AC price for calculation when shifted date doesn't exist?
data["D-Chg"]=stock_store['Adj Close'] - stock_store['Adj Close'].shift(1, freq='B')
Thanks with Regards
format your first column to datetime.
data['Mycol'] = pd.to_datetime(data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
get the max value.
last_date = data['date'].max()
Get the most up-to-date row
is_last = data['date'] == last_date
data[is_last]
This may be done in one step if you give your desired column to max().

Front fill pandas DataFrame conditional on time

I have the following daily dataframe:
daily_index = pd.date_range(start='1/1/2015', end='1/01/2018', freq='D')
random_values = np.random.randint(1, 3,size=(len(daily_index), 1))
daily_df = pd.DataFrame(random_values, index=daily_index, columns=['A']).replace(1, np.nan)
I want to map each value to a dataframe where each day is expanded to multiple 1 minute intervals. The final DF looks like so:
intraday_index = pd.date_range(start='1/1/2015', end='1/01/2018', freq='1min')
intraday_df_full = daily_df.reindex(intraday_index)
# Choose random indices.
drop_indices = np.random.choice(intraday_df_full.index, 5000, replace=False)
intraday_df = intraday_df_full.drop(drop_indices)
In the final dataframe, each day is broken into 1 min intervals, but some are missing (so the minute count on each day is not the same). Some days have a value in the beginning of the day, but nan for the rest.
My question is, only for the days which start with some value in the first minute, how do I front fill for the rest of the day?
I initially tried to simply do the following daily_df.reindex(intraday_index, method='ffill', limit=1440), but since some rows are missing, this cannot work. Maybe there a way to limit by time?
Following #Datanovice's comments, this line achieves the desired result:
intraday_df.groupby(intraday_df.index.date).transform('ffill')
where my groupby defines the desired groups on which we want to apply the operation and transform does this without modifying the DataFrame's index.

Find Maximum Date within Date Range without filtering in Python

I have a file with one row per EMID per Effective Date. I need to find the maximum Effective date per EMID that occurred before a specific date. For instance, if EMID =1 has 4 rows, one for 1/1/16, one for 10/1/16, one for 12/1/16, and one for 12/2/17, and I choose the date 1/1/17 as my specific date, I'd want to know that 12/1/16 is the maximum date for EMID=1 that occurred before 1/1/17.
I know how to find the maximum date overall by EMID (groupby.max()). I also can filter the file to just dates before 1/1/17 and find the max of the remaining rows. However, ultimately I need the last row before 1/1/17, and then all the rows following 1/1/17, so filtering out the rows that occur after the date isn't optimal, because then I have to do complicated joins to get them back in.
# Create dummy data
dummy = pd.DataFrame(columns=['EmID', 'EffectiveDate'])
dummy['EmID'] = [random.randint(1, 10000) for x in range(49999)]
dummy['EffectiveDate'] = [np.random.choice(pd.date_range(datetime.datetime(2016,1,1), datetime.datetime(2018,1,3))) for i in range(49999)]
#Create group by
g = dummy.groupby('EmID')['EffectiveDate']
# This doesn't work, but effectively shows what I'm trying to do
dummy['max_prestart'] = max(dt for dt in g if dt < datetime(2017,1,1))
I expect that output to be an additional column in my dataframe that has the maximum date that occurred before the specified date.
Using map after selected .
s=dummy.loc[dummy.EffectiveDate>'2017-01-01'].groupby('EmID').EffectiveDate.max()
dummy['new']=dummy.EmID.map(s)
Here Using transform and assuming else dt
dummy['new']=dummy.loc[dummy.EffectiveDate>'2017-01-01'].groupby('EmID').EffectiveDate.transform('max')
dummy['new']=dummy['new'].fillna(dummy.EffectiveDate)

Categories

Resources