I need to confirm few thing related to pandas exponential weighted moving average function.
If I have a data set df for which I need to find a 12 day exponential moving average, would the method below be correct.
exp_12=df.ewm(span=20,min_period=12,adjust=False).mean()
Given the data set contains 20 readings the span (Total number of values) should equal to 20.
Since I need to find a 12 day moving average hence min_period=12.
I interpret span as total number of values in a data set or the total time covered.
Can someone confirm if my above interpretation is correct?
I can't get the significance of adjust.
I've attached the link to pandas.df.ewm documentation below.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html
Quoting from Pandas docs:
Span corresponds to what is commonly called an “N-day EW moving average”.
In your case, set span=12.
You do not need to specify that you have 20 datapoints, pandas takes care of that. min_period may not be required here.
Related
I have a DataFrame tracking Temperatures based on time.
it looks like this :
For a few days there was a problem and it shows 0 so the plot looks like this:
I have replaced the 0 with nans and then used interpolate method but the result is not what I need even I used method = time I get this:
So how can I use a customised interpolation or something to correct this based on previous behaviour?
Thank you
I would not interpolate. I would just take N elements before the gap and after, compute the average temperature and fill the gap with random values using a normal distribution around the average value (and you can use the std too)
The variances between the frames indicates true variable values missing or not calculated within the set range.
<1,2/,3,...?>="unknown""unknowns"/^#'table'/open(set)
I have searched for a while, but nothing related to my question is found.
So I post a new thread.
I have a simple dataset which is read in by pandas as dataframe, with some daily data starting on 1951-08-01, ending on 2018-10-01.
Now I want to down-sample the data to decadal mean, so I can simply do df.resample('10A').mean()['some data'].
This gives me 8 data points, which are at 1951-12, 1961-12, 1971-12, 1981-12, 1991-12, 2001-12, 2011-12, 2021-12. This indicates that the decadal mean values are calculated for year 1951 separately, years 1952-1961, 1962-1971, etc.
I wonder if it is possible to calculate the decadal mean values every 'structured' 10 years?
for example, the decadal mean values are calculated betwen 1950-1959, 1960-1969, 1970-1979, etc.
Any help is appreciated!
You can calculate the decade separately and group on that:
decade = df['Date'].dt.year.floordiv(10).mul(10)
df.groupby(decade)['Value'].mean()
I have the following data in a pandas dataframe:
FileName Onsets Offsets
FileName1 [0, 270.78, 763.33] [188.56, 727.28, 1252.90]
FileName2 [0, 634.34, 1166.57, 1775.95, 2104.01] [472.04, 1034.37, 1575.88, 1970.79, 2457.09]
FileName3 [0, 560.97, 1332.21, 1532.47] [356.79, 1286.26, 1488.54, 2018.61]
These are data from audio files. Each row contains a series of onset and offset times for each of the sounds I'm researching. This means that the numbers are coupled, e.g. the second offset time marks the end of the sound that began at the second onset time.
To test a hypothesis, I need to select random offset times within various ranges. For instance, I need to multiply each offset time by between 0.95 and 1.05 to create random adjustments within a +/- 5% range around the actual offset time. Then 0.90 to 1.10, and so forth.
Importantly, the adjustment needs to not push the offset time earlier or later than the preceding or subsequent onset time. I think this means that I need to initially calculate the largest acceptable adjustment for each offset time, and then set the maximum allowable time for the whole dataset to be whatever the lowest acceptable adjustment is. I'll be using this code for different datasets, so this maximum adjustment percentage shouldn't be hardcoded.
How can I code this function?
The code below generates adjustments, but I haven't figured out to calculate and set the bounds yet.
import random
Offsets_5 = Offsets*(random.uniform(0.95,1.05))
Offsets_10 = Offsets*(random.uniform(0.90,1.10))
Offsets_15 = Offsets*(random.uniform(0.85,1.15))
I'm trying to find the maximum rainfall value for each season (DJF, MAM, JJA, SON) over a 10 year period. I am using netcdf data and xarray to try and do this. The data consists of rainfall (recorded every 3 hours), lat, and lon data. Right now I have the following code:
ds.groupby('time.season).max('time')
However, when I do it this way the output has a shape of (4,145,192) indicating that it's taking the maximum value for each season over the entire period. I would like the maximum for each individual season every year. In other words, output should have something with a shape like (40,145,192) (4 values for each year x 10 years)
I've looked into trying to do this with DataSet.resample as well using time=3M as the frequency, but then it doesn't split the months up correctly. If I have to I can alter the dataset, so it starts in the correct place, but I was hoping there would be an easier way considering there's already a function to group it correctly.
Thanks and let me know if you need anymore details!
Resample is going to be the easiest tool for this job. You are close with the time frequency but you probably want to use the quarterly frequency with an offset:
ds.resample(time='QS-Mar').max('time')
These offsets can be further configured as described in the Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
I will be shocked if there isn't some standard library function for this especially in numpy or scipy but no amount of Googling is providing a decent answer.
I am getting data from the Poloniex exchange - cryptocurrency. Think of it like getting stock prices - buy and sell orders - pushed to your computer. So what I have is timeseries of prices for any given market. One market might get an update 10 times a day while another gets updated 10 times a minute - it all depends on how many people are buying and selling on the market.
So my timeseries data will end up being something like:
[1 0.0003234,
1.01 0.0003233,
10.0004 0.00033,
124.23 0.0003334,
...]
Where the 1st column is the time value (I use Unix timestamps to the microsecond but didn't think that was necessary in the example. The 2nd column would be one of the prices - either the buy or sell prices.
What I want is to convert it into a matrix where the data is "sampled" at a regular time frame. So the interpolated (zero-order hold) matrix would be:
[1 0.0003234,
2 0.0003233,
3 0.0003233,
...
10 0.0003233,
11 0.00033,
12 0.00033,
13 0.00033,
...
120 0.00033,
125 0.0003334,
...]
I want to do this with any reasonable time step. Right now I use np.linspace(start_time, end_time, time_step) to create the new time vector.
Writing my own, admittedly crude, zero-order hold interpolator won't be that hard. I'll loop through the original time vector and use np.nonzero to find all the indices in the new time vector which fit between one timestamp (t0) and the next (t1) then fill in those indices with the value from time t0.
For now, the crude method will work. The matrix of prices isn't that big. But I have to think there a faster method using one of the built-in libraries. I just can't find it.
Also, for the example above I only use a matrix of Nx2 (column 1: times, column 2: price) but ultimately the market has 6 or 8 different parameters that might get updated. A method/library function that could handled multiple prices and such in different columns would be great.
Python 3.5 via Anaconda on Windows 7 (hopefully won't matter).
TIA
For your problem you can use scipy.interpolate.interp1d. It seems to be able to do everything that you want. It is able to do a zero order hold interpolation if you specify kind="zero". It can also simultaniously interpolate multiple columns of a matrix. You will just have to specify the appropriate axis. f = interp1d(xData, yDataColumns, kind='zero', axis=0) will then return a function that you can evaluate at any point in the interpolation range. You can then get your normalized data by calling f(np.linspace(start_time, end_time, time_step).