Difference of 2 columns in pandas dataframe with some given conditions - python

I have a sheet like this. I need to calculate absolute of "CURRENT HIGH" - "PREVIOUS DAY CLOSE PRICE" of particular "INSTRUMENT" and "SYMBOL".
So I used .shift(1) function of pandas dataframe to create a lagged close column and then I am subtracting current HIGH and lagged close column but that also subtracts between 2 different "INSTRUMENTS" and "SYMBOL". But if a new SYMBOL or INSTRUMENTS appears I want First row to be NULL instead of subtracting current HIGH and lagged close column.
What should I do?

I believe you need if all days are consecutive per groups:
df['new'] = df['HIGH'].sub(df.groupby(['INSTRUMENT','SYMBOL'])['CLOSE'].shift())

Related

Python dataframe: Standard deviation of last one year of data

I have dataframe df with daily stock market for 10 years having columns Date, Open, Close.
I want to calculate the daily standard deviation of the close price. For this the mathematical formula is:
Step1: Calculate the daily interday change of the Close
Step2: Next, calculate the daily standard deviation of the daily interday change (calculated from Step1) for the last 1 year of data
Presently, I have figured out Step1 as per the code below. The column Interday_Close_change calculates the difference between each row and the value one day ago.
df = pd.DataFrame(data, columns=columns)
df['Close_float'] = df['Close'].astype(float)
df['Interday_Close_change'] = df['Close_float'].diff()
df.fillna('', inplace=True)
Questions:
(a). How to I obtain a column Daily_SD which finds the standard deviation of the last 252 days (which is 1 year of trading days)? On Excel, we have the formula STDEV.S() to do this.
(b). The Daily_SD should begin on the 252th row of the data since that is when the data will have 252 datapoints to calculate from. How do I realize this?
It looks like you are trying to calculate a rolling standard deviation, with the rolling window consisting of previous 252 rows.
Pandas has many .rolling() methods, including one for standard deviation:
df['Daily_SD'] = df['Interday_Close_change'].rolling(252).std().shift()
If there is less than 252 rows available from which to calculate the standard deviation, the result for the row will be a null value (NaN). Think about whether you really want to apply the .fillna('') method to fill null values, as you are doing. That will convert the entire column from a numeric (float) data type to object data type.
Without the .shift() method, the current row's value will be included in calculations. The .shift() method will shift all rolling standard deviation values down by 1 row, so the current row's result will be the standard deviation of the previous 252 rows, as you want.
with pandas version >= 1.2 you can use this instead:
df['Daily_SD'] = df['Interday_Close_change'].rolling(252, closed='left').std()
The closed=left parameter will exclude the last point in the window from calculations.

How to compare elements of one dataframe to another?

I have a dataframe, called PORResult, of daily temperatures where rows are years and each column is a day (121 rows x 365 columns). I also have an array, called Percentile_90, of a threshold temperature for each day (length=365). For every day for every year in the PORResult dataframe I want to find out if the value for that day is higher than the value for that day in the Percentile_90 array. The results of which I want to store in a new dataframe, called Count (121rows x 365 columns). To start, the Count dataframe is full of zeros, but if the daily value in PORResult is greater than the daily value in Percentile_90. I want to change the daily value in Count to 1.
This is what I'm starting with:
for i in range(len(PORResult)):
if PORResult.loc[i] > Percentile_90[i]:
CountResult[i]+=1
But when I try this I get KeyError:0. What else can I try?
(Edited:)
Depending on your data structure, I think
CountResult = PORResult.gt(Percentile_90,axis=0).astype(int)
should do the trick. Generally, the toolset provided in pandas is sufficient that for-looping over a dataframe is unnecessary (as well as remarkably inefficient).

pandas create a column to compare a value with one week ago

I have a pandas dataframe and the index column is time with hourly precision. I want to create a new column that compares the value of the column "Sales number" at each hour with the same exact time one week ago.
I know that it can be written in using shift function:
df['compare'] = df['Sales'] - df['Sales'].shift(7*24)
But I wonder how can I take advantage of the date_time format of the index. I mean, is there any alternatives to using shift(7*24) when the index is in date_time format?
Try something with
df['Sales'].shift(7,freq='D')

Create multiple columns within groupby for calculation?

I have a dataset like this:
SKU,Date,Inventory,Sales,Incoming
2010,2017-01-01 0:00:00,85,126,252
2010,2017-02-01 0:00:00,382,143,252
2010,2017-03-01 0:00:00,414,139,216
2010,2017-04-01 0:00:00,468,120,216
7770,2017-01-01 0:00:00,7,45,108
7770,2017-02-01 0:00:00,234,64,216
7770,2017-03-01 0:00:00,160,69,36
7770,2017-04-01 0:00:00,150,50,72
7870,2017-01-01 0:00:00,41,29,36
7870,2017-02-01 0:00:00,95,18,36
7870,2017-03-01 0:00:00,112,16,36
7870,2017-04-01 0:00:00,88,19,0
Inventory Quantity is the "actual" recorded quantity, which may differ from the hypothetical remaining quantity, which is what I am trying to calculate.
Sales Quantity actually extends much longer into the future. In those rows, the other two columns will have NA.
I want to create the following:
Take only the first Inventory value of each SKU
Use the first value to calculate the hypothetical remaining quantity by using a recursive formula [Earliest inventory] - [Sales for that month] - [Incoming qty for that month] (Note: Earliest inventory is a fixed quantity for each SKU). Store the output in a column called "End of Month Part 1".
Create another column called "Buy Quantity" with the following criteria: If remaining quantity is less than 50, then create a new column that indicates the buy amount (let's say it's 30 for all 3 SKUs) (i.e. increase the quantity by 30). If the remaining quantity is more than 50, then the buy amount is zero.
Create another column called "End of Month Part 2" that adds "End of Month Part 1" with "Buy Quantity"
I am able to obtain the first quantity of each SKU using the following code, and merge it into a column called first_qty into the dataset
first_qty_series = dataset.groupby(by=['SKU']).nth(0)['Inventory']
first_qty_series = pd.DataFrame(dataset).reset_index().rename(columns={'Inventory': 'Earliest inventory'})
dataset = pd.merge(dataset, pd.DataFrame(first_qty_series), on='SKU' )
As for the remainder quantity, I thought of using cumsum() on the two columns dataset['Sales'] and dataset['Incoming'] but I think it won't work because the cumsum() will sum across ALL SKUs.
That's why I think I need to perform the calculation in groupby. But I don't know what else to do.
(Edit:) Expected output is:
Thank you guys!
Here is a way to do the 4 columns you want.
1 - Another method, using loc and drop_duplicates to fill the first row for each 'SKU' with the value from 'Inventory', and then use ffill to fill the following rows, but your method is good.
dataset.loc[dataset.drop_duplicates(['SKU']).index,'Earliest inventory'] = dataset['Inventory']
dataset['Earliest inventory'] = dataset['Earliest inventory'].ffill().astype(int)
2 - Indeed you need cumsum and groupby to create the column 'End of Month Part 1', not on the column 'Earliest inventory' as the value is the same on every row for a same 'SKU'. Note: according to your result (and logic), I change the - with + before the column 'Incoming', and if I misunderstood the problem, just change the sign.
dataset['End of Month Part 1'] = (dataset['Earliest inventory']
- dataset.groupby('SKU')['Sales'].cumsum()
+ dataset.groupby('SKU')['Incoming'].cumsum())
3 - The column 'Buy Quantity' can be implemented using loc again meeting the condition on value less than 50 in column 'End of Month Part 1', then fillna with 0
dataset.loc[dataset['End of Month Part 1'] <= 50, 'Buy Quantity'] = 30
dataset['Buy Quantity'] = dataset['Buy Quantity'].fillna(0).astype(int)
4 - Finally the last column is just adding the two lasts created
dataset['End of Month Part 2'] = dataset['End of Month Part 1'] + dataset['Buy Quantity']
If I understood well the 4 points, you should get the dataset with the new columns

What is pandas syntax for lookup based on existing columns + row values?

I'm trying to recreate a bit of a convoluted scenario, but I will do my best to explain it:
Create a pandas df1 with two columns: 'Date' and 'Price' - done
I add two new columns: 'rollmax' and 'rollmin', where the 'rollmax' is an 8 days rolling maximum and 'rollmin' is a
rolling minimum. - done
Now I need to create another column 'rollmax_date' that would get
populated through a look up rule:
for the row n, go to the column 'Price' and parse through the values
for the last 8 days and find the maximum, then get the value of the
corresponding column 'Price' and put this value in the column 'rollingmax_date'.
the same logic for the 'rollingmin_date', but instead of rolling maximum date, we look for the rolling minimum date.
Now I need to find the previous 8 days max and min for the same rolling window of 8 days that I have already found.
I did the first two and tried the third one, but I'm getting wrong results.
The code below gives me only dates where on the same row df["Price"] is the same as df['rollmax'], but it doesn't bring all the corresponding dates from 'Date' to 'rollmax_date'
df['rollmax_date'] = df.loc[(df["Price"] == df.rollmax), 'Date']
This is an image with steps for recreating the lookup

Categories

Resources