Pandas to calculate pct_change on specified freq option - python

I would like to calculate 1 year, 2, years, 3 years growth rate of weekly/daily DataFrame :
start = '20100101'
end = '20201117'
df_ts = pd.DataFrame(index=pd.bdate_range(start=start, end=end, freq='D'))
df_ts['valeur1'] = range(1, df_ts.shape[0]+1)
df_ts['gr'] = 100*df_ts.pct_change(periods=1, freq='Y)
df_ts
I thought pct_change(periods=n, freq='Y') was the good way to do it but I get an error result with this simple date.
I need to emphasise that in my data I have weekly/daily data and I operate others operation so I need to put this inside apply(lambda x: x.pct_change(periods=n, freq='Y').
Any suggestions to do it simply ?

You are assigning a new dataframe to a column because you did .pct_change on the dataframe not the column. Try this:
df_ts['gr'] = df_ts['valeur1'].pct_change(periods=1, freq='Y')

Related

how to get result of expanding/resample function in original dataframe using python

have a dataframe with 1 minute timestamp of open, high, low, close, volume for a token.
using expanding or resample function, one can get a new dataframe based on the timeinterval. in my case its 1 day time interval.
i am looking to get the above output in the original dataframe. please assist in the same.
original dataframe:
desired dataframe:
Here "date_1d" is the time interval for my use case. i used expanding function but as the value changes in "date_1d" column, expanding function works on the whole dataframe
df["high_1d"] = df["high"].expanding().max()
df["low_1d"] = df["low"].expanding().min()
df["volume_1d"] = df["volume"].expanding().min()
then the next challenge was how to find Open and Close based on "date_1d" column
Please assist or ask more questions, if not clear on my desired output.
Fyi - data is huge for 5 years 1 minute data for 100 tokens
thanks in advance
Sukhwant
I'm not sure if I understand it right but for me it looks like you want to groupbyeach day and calculate first last min max for them.
Is the column date_1d already there ?
If not:
df["date_1d"] = df["date"].dt.strftime('%Y%m%d')
For the calculations:
df["open_1d"] = df.groupby("date_1d")['open'].transform('first')
df["high_1d"] = df.groupby("date_1d")['high'].transform('max')
df["low_1d"] = df.groupby("date_1d")['low'].transform('min')
df["close_1d"] = df.groupby("date_1d")['close'].transform('last')
EDIT:
Have a look in your code if this works like you expect it (till we have some of your data I can only guess, sorry :D )
df['high_1d'] = df.groupby('date_1d')['high'].expanding().max().values
It groups the data per "date_1d" but in the group only consider row by row (and the above rows)
EDIT: Found a neat solution using transform method. Erased the need for a "Day" Column as df.groupby is made using index.date attribute.
import pandas as pd
import yfinance as yf
df = yf.download("AAPL", interval="1m",
start=datetime.date.today()-datetime.timedelta(days=6))
df['Open_1d'] = df["Open"].groupby(
df.index.day).transform('first')
df['Close_1d'] = df["Close"].groupby(
df.index.day).transform('last')
df['High_1d'] = df['High'].groupby(
df.index.day).expanding().max().droplevel(level=0)
df['Low_1d'] = df['Low'].groupby(
df.index.day).expanding().min().droplevel(level=0)
df['Volume_1d'] = df['Volume'].groupby(
df['Day']).expanding().sum().droplevel(level=0)
df
Happy Coding!

Efficient way of vertically expanding a Dataframe by row and keeping the same values?

I am doing this educational challenge on kaggle https://www.kaggle.com/c/competitive-data-science-predict-future-sales
The training set is a file of daily sales numbers of some products and the test set we need to predict is the sales of similar items for the month of november.
Now I would like to use my model to make daily predictions and thus expand the test data set by 30 for each row.
I have the following code:
for row in test.itertuples():
df = pd.DataFrame(index = nov15, columns = test.columns)
df['shop_id'] = row.shop_id
df['item_category_id'] = row.item_category_id
df['item_price'] = row.item_price
df['item_id'] = row.item_id
df = df.reset_index()
df.columns = ['date', 'item_id', 'shop_id', 'item_category_id', 'item_price']
df = df[train.columns]
tt = pd.concat([tt, df])
nov15 is a pandas daterange from 1/nov/2015 to 30/nov/2015
tt is just an empty dataset I fill by expanding it by 30 rows (nov 1 to 30) for every row in the test set.
test is the original dataframe I am copying the rows from
It runs, but it takes hours to complete.
Knowing pandas and learning from previous experiences, there is probably an efficient way to do this.
Thank you for your help!
So I have found a "more" efficient way, and then someone over at Reddit's r/learnpython has told me about the correct and most efficient way.
This above dilemma is easily solved by pandas explode function.
And these two lines do what I did above, but within seconds:
test['date'] = [nov15 for _ in range(len(test))]
test = test.explode('date')
Now my more efficient way or second solution, which is in no way anywhere close to equivalent or good was to simply make 30 copies of the dataframe with a column 'date' added.

Pandas: Calculate the percentage between two rows and add the value as a column

I have a dataset structured like this:
"Date","Time","Open","High","Low","Close","Volume"
This time series represent the values of a generic stock market.
I want to calculate the difference in percentage between two rows of the column "Close" (in fact, I want to know how much the value of the stock increased or decreased; each row represent a day).
I've done this with a for loop(that is terrible using pandas in a big data problem) and I create the right results but in a different DataFrame:
rows_number = df_stock.shape[0]
# The first row will be 1, because is calculated in percentage. If haven't any yesterday the value must be 1
percentage_df = percentage_df.append({'Date': df_stock.iloc[0]['Date'], 'Percentage': 1}, ignore_index=True)
# Foreach days, calculate the market trend in percentage
for index in range(1, rows_number):
# n_yesterday : 100 = (n_today - n_yesterday) : x
n_today = df_stock.iloc[index]['Close']
n_yesterday = self.df_stock.iloc[index-1]['Close']
difference = n_today - n_yesterday
percentage = (100 * difference ) / n_yesterday
percentage_df = percentage_df .append({'Date': df_stock.iloc[index]['Date'], 'Percentage': percentage}, ignore_index=True)
How could I refactor this taking advantage of dataFrame api, thus removing the for loop and creating a new column in place?
df['Change'] = df['Close'].pct_change()
or if you want to calucale change in reverse order:
df['Change'] = df['Close'].pct_change(-1)
I would suggest to first make the Date column as DateTime indexing for this you can use
df_stock = df_stock.set_index(['Date'])
df_stock.index = pd.to_datetime(df_stock.index, dayfirst=True)
Then simply access any row with specific column by using datetime indexing and do any kind of operations whatever you want for example to calculate difference in percentage between two rows of the column "Close"
df_stock['percentage'] = ((df_stock['15-07-2019']['Close'] - df_stock['14-07-2019']['Close'])/df_stock['14-07-2019']['Close']) * 100
You can also use for loop to do the operations for each date or row:
for Dt in df_stock.index:
Using diff
(-df['Close'].diff())/df['Close'].shift()

Python Pandas: fill a column using values from rows at an earlier timestamps

I have a dataframe df where one column is timestamp and one is A. Column A contains decimals.
I would like to add a new column B and fill it with the current value of A divided by the value of A one minute earlier. That is:
df['B'] = df['A']_current / df['A'] _(current - 1 min)
NOTE: The data does not come in exactly every 1 minute so "the row one minute earlier" means the row whose timestamp is the closest to (current - 1 minute).
Here is how I do it:
First, I use the timestamp as index in order to use get_loc and create a new dataframe new_df starting from 1 minute after df. In this way I'm sure I have all the data when I go look 1 minute earlier within the first minute of data.
new_df = df.loc[df['timestamp'] > df.timestamp[0] + delta] # delta = 1 min timedelta
values = []
for index, row n new_df.iterrows():
v = row.A / df.iloc[df.index.get_loc(row.timestamp-delta,method='nearest')]['A']
values.append[v]
v_ser = pd.Series(values)
new_df['B'] = v_ser.values
I'm afraid this is not that great. It takes a long time for large dataframes. Also, I am not 100% sure the above is completely correct. Sometimes I get this message:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
What is the best / most efficient way to do the task above? Thank you.
PS. If someone can think of a better title please let me know. It took me longer to write the title than the post and I still don't like it.
You could try to use .asof() if the DataFrame has been indexed correctly by the timestamps (if not, use .set_index() first).
Simple example here
import pandas as pd
import numpy as np
n_vals = 50
# Create a DataFrame with random values and 'unusual times'
df = pd.DataFrame(data = np.random.randint(low=1,high=6, size=n_vals),
index=pd.DatetimeIndex(start=pd.Timestamp.now(),
freq='23s', periods=n_vals),
columns=['value'])
# Demonstrate how to use .asof() to get the value that was the 'state' at
# the time 1 min since the index. Note the .values call
df['value_one_min_ago'] = df['value'].asof(df.index - pd.Timedelta('1m')).values
# Note that there will be some NaNs to deal with consider .fillna()

Creating time series DataFrame from event data

I have a dataset of locations of stores with dates of events (the date all stock was sold from that store) and quantities of the sold items, such as the following:
import numpy as np, pandas as pd
# Dates
start = pd.Timestamp("2014-02-26")
end = pd.Timestamp("2014-09-24")
# Generate some data
N = 1000
quantA = np.random.randint(10, 500, N)
quantB = np.random.randint(50, 250, N)
sell = np.random.randint(start.value, end.value, N)
sell = pd.to_datetime(np.array(sell, dtype="datetime64[ns]"))
df = pd.DataFrame({"sell_date": sell, "quantityA":quantA, "quantityB":quantB})
df.index = df.sell_date
I would like to create a new time series dataframe that has per-weekly summaries (or per daily; or per custom date_range object) from a range of these quantities A and B.
I can generate week number and aggregate sales based on those, like so...
df['week'] = df.sell_date.dt.week
df.pivot_table(values = ['quantityA', 'quantityB'], index = 'week', aggfunc = [np.sum, len])
But I don't see how to do the following:
expand this out to a full time series (based on a date_range object, such as period_range = pd.date_range(start = start, end = end, freq='7D')),
include the original date (as a 'week starting' variable), instead of integer week number, or
change the date variable to be the index of this new dataframe.
I'm not sure if this is what you want but you can try
df.set_index('sell_date', inplace=True)
resampled = df.resample('7D', [sum, len])
The resulting index might not be exactly what you want as it starts with the earliest datetime correct to the nanosecond. You could replace with datetimes which have 00:00:00 in the time by doing
resampled.index = pd.to_datetime(resampled.index.date)
EDIT:
You can actually just do
resampled = df.resample('W', [sum, len])
And the resulting index is exactly what you want. Interestingly, passing 'D' also gives the index you would expect but passing a multiple like '2D' results in the 'ugly' index, that is, starting at the earliest correct to the nanosecond and increasing in multiples of exactly 2 days. I guess the lesson is stick to singles like 'D', 'W', 'M' where possible.
EDIT:
The API for resampling changed at some point such that the above no longer works. Instead one can do:
resampled = df.resample('W').agg([sum, len])
.resample now returns a Resampler object which exposes methods, much like the groupbyAPI.

Categories

Resources