How to reference row below in Python Pandas Dataframe? - python

I have a function that gets the stock price (adjusted closing price) on a specific date of the format DD-MM-YYYY. I have a Pandas Dataframe that looks like the following, with a column for date, as well as the stock price calculated using said function.
Date Stock price Percent change
0 02-07-2022 22.09
1 06-04-2022 18.22
2 01-01-2022 16.50
3 30-09-2021 18.15
4 03-07-2021 17.96
I need to calculate the percent change, which is calculated by taking (new/old - 1)*100, so in the top cell it would say (22.09/18.22 - 1)*100 = 21.24039517 because the stock increased 21.2% between 06-04-2022 and 02-07-2022.
So I need to "reference" the row below when applying a function, meanwhile I still reference the current row, because I need both to calculate change. For the bottom one, it can just be NaN or similar. Any suggestions?

I would first sort on date (given that that column is already datetime):
df = df.sort_values(by='Date', ascending=True)
And then calculate the percentage change and fill NaN with 0, or with something else if you prefer:
df["Percent change"] = df["Stock price"].pct_change(periods=1).fillna(0)

Related

How to search for a specific date within concatenated DataFrame TimeSeries. Same Date would repeat several times in a merged df

I downloaded historical price data for ^GSPC Share Market Index (S&P500), and several other Global Indices. Date is set as index.
Selecting values in rows when date is set to index works as expected with .loc.
# S&P500 DataFrame = spx_df
spx_df.loc['2010-01-04']
Open 1.116560e+03
High 1.133870e+03
Low 1.116560e+03
Close 1.132990e+03
Volume 3.991400e+09
Dividends 0.000000e+00
Stock Splits 0.000000e+00
Name: 2010-01-04 00:00:00-05:00, dtype: float64
I then concatenated several Stock Market Global Indices into a single DataFrame for further use. In effect, any date in range will be included five times when historical data for five Stock Indices are linked in a Time Series.
markets = pd.concat(ticker_list, axis = 0)
I want to reference a single date in concatenated df and set it as a variable. I would prefer if the said variable didn't represent a datetime object, because I would like to access it with .loc as part of def function. How does concatenate effect accessing rows via date as index if the same date repeats several times in a linked TimeSeries?
This is what I attempted so far:
# markets = concatenated DataFrame
Reference_date = markets.loc['2010-01-04']
# KeyError: '2010-01-04'
Reference_date = markets.loc[markets.Date == '2010-01-04']
# This doesn't work because Date is not an attribute of the DataFrame
Since you have set date as index you should be able to do:
Reference_date = markets.loc[markets.index == '2010-01-04']
To access a specific date in the concatenated DataFrame, you can use boolean indexing instead of .loc. This will return a DataFrame that contains all rows where the date equals the reference date:
reference_date = markets[markets.index == '2010-01-04']
You may also want to use query() method for searching for specific data
reference_date = markets.query('index == "2010-01-04"')
Keep in mind that the resulting variable reference_date is still a DataFrame and contains all rows that match the reference date across all the concatenated DataFrames. If you want to extract only specific columns, you can use the column name like this:
reference_date_Open = markets.query('index == "2010-01-04"')["Open"]

How to select rows of a dataframe using MultiIndex in Pandas

I have this two dataframes I need to relate.
The first one, HOLIDAYS, gives me local holiday dates and the stores code in which they're celebrated
Holiday date
Store code
01/02
18005
01/02
18032
...
...
31/03
18043
The second one, BALANCE, shows balance of stores in certain dates with date and number stores as index.
balance
01/02
18001
$35,00
01/02
18002
$38,00
...
...
...
31/03
18099
$20,45
What I need to do is to create a column in BALANCE named Holiday with a boolean value showing if a certain row is showing a balance obtained during a holiday or not.
I tried to create the column 'Holiday' setting the initial value as False and then assigning every value of HOLIDAY in index of BALANCE dataframe to True, I'm getting ValueError (possibly because a dataframe cannot be passed as index of other). I tried to convert HOLIDAY to MultiIndex but again it's not working.
BALANCE['Holiday'] = False
H = pd.MultiIndex.from_frame(HOLIDAY)
BALANCE.loc[H, 'Holiday'] = True
I'm pretty sure this should not be difficult but I'm out of ideas now. There is any way I could work with the first dataframe as MultiIndex of the second?
Your example doesn't have any rows which match, but this should work:
HOLIDAYS['is_holiday'] = True
res = pd.merge(BALANCE,
HOLIDAYS,
how='left',
left_index=True,
right_on=['Holiday_date', 'Store_code'])
res['is_holiday'] = res['is_holiday'].fillna(False)

How to update multiple dataframe column cells using function return values

I have a follow up question from my previous question here How to update dataframe cells using function return values
I have the following dataframe called df1,
country ticker price exchange dateupdated
0 US MSFT 105.32 Nasdaq 05-May-2019
1 US AAPL
2 GERMANY NSU.DE 10.42 XETRA 06-May-2019
3 SG D05.SI
4 AUS WOW.AX
I have a function called data_get that looks like this. It takes in the ticker, from the ticker column, example 'MSFT' and it returns a list.
def data_get(ticker):
listofdata = somefunction
return listofdata
listofdata = data_get('AAPL')
listofdata is a list of strings containing, price, exchange and dateupdated.
listofdata = ['150.22', 'Nasdaq', '05-May-2020']
The function has to go online to look up the data so it takes a few seconds to run each time.
I want to only use this function on the cells which don't have a price in them, (price cells are empty).
So the function would only be used on rows 1, 3 & 4 in this dataframe and update the price, exchange and dateupdated values in them to look something like this.
country ticker price exchange dateupdated
0 US MSFT 105.32 Nasdaq 05-May-2019
1 US AAPL 150.22 Nasdaq 05-May-2020
2 GERMANY NSU.DE 10.42 XETRA 06-May-2019
3 SG D05.SI 23.44 SGX 04-May-2020
4 AUS WOW.AX 12.33 ASX 04-May-2020
To summarize, how do I;
apply the data_get function to only the rows in which price is empty (1, 3, 4),
use the ticker column string (AAPL, D05.SI, WOW.AX) as an input for the data_get function,
use the list output from the data_get function and update the price, exchange and dateupdated columns?
It would also be a bonus if I could convert the strings to a float for the price column and a date for the dateupdated column.
Do an apply where the values are nan
df.loc[df.prices.isnull(), ['price','exchange','dateupdated']] =
df.loc[df.prices.isnull(), 'ticker'].apply(data_get)
Return Type : list directly added to respective columns
I did it this way, keeping in mind you are returning a list, so to modify the update the data points individually,
df.loc[df.Price].isnull(),['Price','Exchange','Update_date']] = pd.DataFrame(df.loc[df.Price.isnull(),'Ticker'].apply(data_get).to_dict()).transpose().rename(columns={0:'Price',1:'Exchange',2:'Update_date'})
This single line would directly take the dataframe and based on your function's return type, add the list elements in order to the corresponding columns in the dataframe

Groupby with Apply Method in Pandas : Percentage Sum of Grouped Values

I am trying to develop a program to convert daily data into monthly or yearly data and so on.
I have a DataFrame with datetime index and price change %:
% Percentage
Date
2015-06-02 0.78
2015-06-10 0.32
2015-06-11 0.34
2015-06-12 -0.06
2015-06-15 -0.41
...
I had success grouping by some frequency. Then I tested:
df.groupby('Date').sum()
df.groupby('Date').cumsum()
If it was the case it would work fine, but the problem is that I can't sum it percent way (1+x0) * (1+x1)... -1. Then I tried:
def myfunc(values):
p = 0
for val in values:
p = (1+p)*(1+val)-1
return p
df.groupby('Date').apply(myfunc)
I can't understand how apply () works. It seems to apply my function to all data and not just to the grouped items.
Your apply is applying to all rows individually because you're grouping by the date column. Your date column looks to have unique values for each row, so each group has only one row in it. You need to use a Grouper to group by month, then use cumprod and get the last value for each group:
# make sure Date is a datetime
df["Date"] = pd.to_datetime(df["Date"])
# add one to percentages
df["% Percentage"] += 1
# use cumprod on each month group, take the last value, and subtract 1
df.groupby(pd.Grouper(key="Date", freq="M"))["% Percentage"].apply(lambda g: g.cumprod().iloc[-1] - 1)
Note, though, that this applies the percentage growth as if the steps between your rows were the same, but it looks like sometimes it's 8 days and sometimes it's 1 day. You may need to do some clean-up depnding on the result you want.

Pandas: Getting a rolling sum while grouping by a column

I have a pandas dataframe that looks like
Name Date Value
Sarah 11-01-2015 3
Sarah 11-02-2015 2
Sarah 11-03-2015 27
Bill 11-01-2015 42
Bill 11-02-2015 5
Bill 11-03-2015 15
.... (a couple hundred rows)
How do I get a 30 day (or x day) rolling sum of these values broken out by whoever is in the 'Name' column? The ideal output would have the same columns as the current dataframe, but instead of having the values for each row be what that person had as a value for that day, it would be the cumulative sum of what their values over the past 30 days.
I know I can do
result = pd.rolling_sum(df, 30)
to get the rolling sum overall. But how do I return a dataframe with that rolling sum grouped by the 'Name' column?
Figured it out using the grigri group_resample function.
df = group_resample(df,date_column='Date',groupby=group_by,value_column='Value',how='sum',freq='d')
df = df.unstack(group_by).fillna(0)
result = pd.rolling_mean(df,30)
Note that if you don't need a precise temporal window, or if your dataset has 1 line per [day , user] (which seems to be your case), then the standard groupby of pandas is perfectly suited. See this very similar question
Otherwise, something like:
df.groupby('Name').rolling('30D', on="Date").Value.sum()
should work.

Categories

Resources