I used the values 360D and 360 (which I thought they were equivalent) for the parameter window of the method .rolling(). However, they produced different graph. Could you please explain what the difference between those two values was?
rolling_stats = data.Ozone.rolling(window='360D').agg(['mean', 'std'])
stats = data.join(rolling_stats)
stats.plot(subplots=True)
plt.show()
rolling_stats = data.Ozone.rolling(window=360).agg(['mean', 'std'])
stats = data.join(rolling_stats)
stats.plot(subplots=True)
plt.show()
The difference is that when you use the string '360D' it calculates rolling by considering "Calendar Days" while when you use 360 (int) it calculates rolling for business days.
The other diffence is that when using string ('360D') pandas calculates Rolling from the begining of the 'Calendar Year' but when you use integer (360), pandas calculates from the last 360 business days. This is why there is a gap in your second plot, because for the first 359 days, rolling.agg() returns NaN values. However, in the first plot there is no gap. If you look further in details, you will find out that for the first method, the data.Ozone.rolling.agg() is equal to data.Ozone.iloc[0]
Related
I have a DataFrame tracking Temperatures based on time.
it looks like this :
For a few days there was a problem and it shows 0 so the plot looks like this:
I have replaced the 0 with nans and then used interpolate method but the result is not what I need even I used method = time I get this:
So how can I use a customised interpolation or something to correct this based on previous behaviour?
Thank you
I would not interpolate. I would just take N elements before the gap and after, compute the average temperature and fill the gap with random values using a normal distribution around the average value (and you can use the std too)
The variances between the frames indicates true variable values missing or not calculated within the set range.
<1,2/,3,...?>="unknown""unknowns"/^#'table'/open(set)
I have dataframe df with daily stock market for 10 years having columns Date, Open, Close.
I want to calculate the daily standard deviation of the close price. For this the mathematical formula is:
Step1: Calculate the daily interday change of the Close
Step2: Next, calculate the daily standard deviation of the daily interday change (calculated from Step1) for the last 1 year of data
Presently, I have figured out Step1 as per the code below. The column Interday_Close_change calculates the difference between each row and the value one day ago.
df = pd.DataFrame(data, columns=columns)
df['Close_float'] = df['Close'].astype(float)
df['Interday_Close_change'] = df['Close_float'].diff()
df.fillna('', inplace=True)
Questions:
(a). How to I obtain a column Daily_SD which finds the standard deviation of the last 252 days (which is 1 year of trading days)? On Excel, we have the formula STDEV.S() to do this.
(b). The Daily_SD should begin on the 252th row of the data since that is when the data will have 252 datapoints to calculate from. How do I realize this?
It looks like you are trying to calculate a rolling standard deviation, with the rolling window consisting of previous 252 rows.
Pandas has many .rolling() methods, including one for standard deviation:
df['Daily_SD'] = df['Interday_Close_change'].rolling(252).std().shift()
If there is less than 252 rows available from which to calculate the standard deviation, the result for the row will be a null value (NaN). Think about whether you really want to apply the .fillna('') method to fill null values, as you are doing. That will convert the entire column from a numeric (float) data type to object data type.
Without the .shift() method, the current row's value will be included in calculations. The .shift() method will shift all rolling standard deviation values down by 1 row, so the current row's result will be the standard deviation of the previous 252 rows, as you want.
with pandas version >= 1.2 you can use this instead:
df['Daily_SD'] = df['Interday_Close_change'].rolling(252, closed='left').std()
The closed=left parameter will exclude the last point in the window from calculations.
I'm trying to calc the annualized return of Amazon stock and can't figure out the main difference between the following approaches
df = pdr.get_data_yahoo('amzn',datetime(2015, 1, 1),datetime(2019, 12, 31))['Adj Close']
1)df.pct_change()).mean()*252
Result = 0,400
2)df.resample('Y').last().pct_change().mean()
Result = 0,472
Why there is a difference about 7% ?
After reading the doc for the functions, I'd like go through an example of resampling time series data for a better understanding.
With resample method the price column of the DataFrame is grouped by a certain time span, in this case the 'Y' indicates a resampling by year and with last() we get the price value at the end of each year.
data.resample('Y').last()
Output: 1. Step
Next, with pct_change() we calculate the percentage change between the values for each row and the previous rows which are the price values at the end of each year that we got before.
data.resample('Y').last().pct_change()
Output: 2. Step
Now, in the final step we calculate the mean percentage change during the entire time period by using the mean() method
data.resample('Y').last().pct_change().mean()
Output: 3. Step
like #itprorh66 already wrote, the main difference between the two approaches is just about when the mean of the values is calculated.
I have a Pandas Series which contains acceleration timeseries data. My goal is to select slices of extreme force given some threshold. I was able to get part way with the following:
extremes = series.where(lambda force: abs(force - RESTING_FORCE) >= THRESHOLD, other=np.nan)
Now extremes contains all values which exceed the threshold and NaN for any that don't, maintaining the original index.
However, a secondary requirement is that nearby peaks should be merged into a single event. Visually, you can picture the three extremes on the left (two high, one low) being joined into one complete segment and the two peaks on the right being joined into another complete segment.
I've read through the entire Series reference but I'm having trouble finding methods to operate on my partial dataset. For example, if I had a method that returned an array of non-NaN index ranges, I would be able to sequentially compare each range and decide whether or not to fill in the space between with values from the original series (nearby) or leave them NaN (too far apart).
Perhaps I need to abandon the intermediate step and approach this from an entirely different angle? I'm new to Python so I'm having trouble getting very far with this. Any tips would be appreciated.
It actually wasn't so simple to come up with a vectorized solution without looping.
You'll probably need to go through the code step by step to see the actual outcome of each method but here is short sketch of the idea:
Solution outline
Identify all peaks via simple threshold filter
Get timestamps of peak values into a column and forward fill gaps in between in order to allow to compare current valid timestamp with previous valid timestamp
Do actual comparison via diff() to get time deltas and apply time delta comparison
Convert booleans to integers to use cummulative sum to create signal groups
Group by signals and get min and max timestamp values
Example data
Here is the code with a dummy example:
%matplotlib inline
import pandas as pd
import numpy as np
size = 200
# create some dummy data
ts = pd.date_range(start="2017-10-28", freq="d", periods=size)
values = np.cumsum(np.random.normal(size=size)) + np.sin(np.linspace(0, 100, size))
series = pd.Series(values, index=ts, name="force")
series.plot(figsize=(10, 5))
Solution code
# define thresholds
threshold_value = 6
threshold_time = pd.Timedelta(days=10)
# create data frame because we'll need helper columns
df = series.reset_index()
# get all initial peaks below or above threshold
mask = df["force"].abs().gt(threshold_value)
# create variable to store only timestamps of intial peaks
df.loc[mask, "ts_gap"] = df.loc[mask, "index"]
# create forward fill to enable comparison between current and next peak
df["ts_fill"] = df["ts_gap"].ffill()
# apply time delta comparison to filter only those within given time interval
df["within"] = df["ts_fill"].diff() < threshold_time
# convert boolean values into integers and
# create cummulative sum which creates group of consecutive timestamps
df["signals"] = (~df["within"]).astype(int).cumsum()
# create dataframe containing start and end values
df_signal = df.dropna(subset=["ts_gap"])\
.groupby("signals")["ts_gap"]\
.agg(["min", "max"])
# show results
df_signal
>>> min max
signals
10 2017-11-06 2017-11-27
11 2017-12-13 2018-01-22
12 2018-02-03 2018-02-23
Finally, show the plot:
series.plot(figsize=(10, 5))
for _, (idx_min, idx_max) in df_signal.iterrows():
series[idx_min:idx_max].plot()
Result
As you can see in the plot, peaks greater an absolute value of 6 are merged into a single signal if their last and first timestamps are within a range of 10 days. The thresholds here are arbitrary just for illustration purpose. you can change them to whatever.
I need to confirm few thing related to pandas exponential weighted moving average function.
If I have a data set df for which I need to find a 12 day exponential moving average, would the method below be correct.
exp_12=df.ewm(span=20,min_period=12,adjust=False).mean()
Given the data set contains 20 readings the span (Total number of values) should equal to 20.
Since I need to find a 12 day moving average hence min_period=12.
I interpret span as total number of values in a data set or the total time covered.
Can someone confirm if my above interpretation is correct?
I can't get the significance of adjust.
I've attached the link to pandas.df.ewm documentation below.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html
Quoting from Pandas docs:
Span corresponds to what is commonly called an “N-day EW moving average”.
In your case, set span=12.
You do not need to specify that you have 20 datapoints, pandas takes care of that. min_period may not be required here.