I used to Facebook Prophet library, now I have a problem.
When I use add_changepoints_to_plot function, I can see red lind and red dots line about change points, but I want to get this values.
How to get a values about change points or incline?
I wanna get numerical values of moments or values of time about change points. And I need way to decision whether the trend goes up or down through values.
Welcome SO. You need to provide some code snippet.
Prophet need a data frame which contains two columns (ds and y).
While column ds contains dates, column y contains value of the date.
As far as i understand, your data have changepoints and you want to see the values on changepoint dates
I'll leave an example code snippet here assuming you have a df with column "ds" and "y":
estimator = Prophet()
estimator.fit(df)
df.loc[df["ds"].isin(estimator.changepoints)]
estimator.changepoints contains the dates which occur changepoints. If you filter these dates from your dataframe you will get changepoint values.
For example:
mdl = Prophet(yearly_seasonality=True, interval_width=0.95, n_changepoints = 5)
mdl.add_country_holidays(country_name='US')
mdl.fit(df)
mdl.changepoints
Output:
62 2021-07-06
125 2021-09-07
187 2021-11-08
250 2022-01-10
312 2022-03-13
Name: ds, dtype: datetime64[ns]
Related
I want to create a graph with lines represented by my label
so in this example picture, each line represents a distinct label
The data looks something like this where the x-axis is the datetime and the y-axis is the count.
datetime, count, label
1656140642, 12, A
1656140643, 20, B
1656140645, 11, A
1656140676, 1, B
Because I have a lot of data, I want to aggregate it by 1 hour or even 1 day chunks.
I'm able to generate the above picture with
# df is dataframe here, result from pandas.read_csv
df.set_index("datetime").groupby("label")["count"].plot
and I can get a time-range average with
df.set_index("datetime").groupby(pd.Grouper(freq='2min')).mean().plot()
but I'm unable to get both rules applied. Can someone point me in the right direction?
You can use .pivot (documentation) function to create a convenient structure where datetime is index and the different labels are the columns, with count as values.
df.set_index('datetime').pivot(columns='label', values='count')
output:
label A B
datetime
1656140642 12.0 NaN
1656140643 NaN 20.0
1656140645 11.0 NaN
1656140676 NaN 1.0
Now when you have your data in this format, you can perform simple aggregation over the index (with groupby / resample/ whatever suits you) so it will be applied each column separately. Then plotting the results is just plotting different line for each column.
I am currently working with PyTorch Forecasting and I want to create dataset with TimeSeriesDataSet. My original data lies in a pandas Dataframe and looks like this:
date amount location
2014-01-01 5 A
2014-01-01 7 B
... ... ...
2017-12-30 4 H
2017-12-31 8 I
So in total I got nine different unique values in "location" and an amount for each location per date. Now I am wondering what the group_ids parameter for the TimeSeriesDataSet class does and what it exact behaviour is? I am not really getting the idea based on the documentation.
Thanks a lot in advance!
A time-series dataset usually contains multiple time-series for different entities/individuals.
group_ids is a list of columns which uniquely determine entities with associated time series. In your example it would be location:
group_ids (List[str]) – list of column names identifying a time series. This means that the group_ids identify a sample together with the time_idx. If you have only one timeseries, set this to the name of column that is constant.
I have numbers stored in 2 data frames (real ones are much bigger) which look like
df1
A B C T Z
13/03/2017 1.321674 3.1790 3.774602 30.898 13.22
06/02/2017 1.306358 3.1387 3.712554 30.847 13.36
09/01/2017 1.361103 3.2280 3.738500 32.062 13.75
05/12/2016 1.339258 3.4560 3.548593 31.978 13.81
07/11/2016 1.295137 3.2323 3.188161 31.463 13.43
df2
A B C T Z
13/03/2017 1.320829 3.1530 3.7418 30.933 13.1450
06/02/2017 1.305483 3.1160 3.6839 30.870 13.2985
09/01/2017 1.359989 3.1969 3.7129 32.098 13.6700
05/12/2016 1.338151 3.4215 3.5231 32.035 13.7243
07/11/2016 1.293996 3.2020 3.1681 31.480 13.3587
and a list where I have stored all daily dates from 13/03/2017 to 7/11/2016.
I would like to create a dataframe with the following features:
the list of daily dates is the indexrow
I would like to create columns (in this case from A to Z) and for each row/ day compute the linear interpolation value between the value in df1 and the corresponding value in df2 shifted by -1. For example, in the row '12/03/2017' for column A I want to compute [(34/35)*1.321674]+[(1/35)*1.305483] = 1.3212114. Where 35 is the number of days between 13/03/2017 and 06/02/2017, 1.321674 is the value in df1 corresponding to column A for the day 13/03/2017 and 1.305483 is the value in df2 corresponding to column A for the day 06/02/2017. For 11/03/2017 for column A I want to compute [(33/35)*1.321674]+[(2/35)*1.305483] = 1.3207488. Thus keeping fixed the values 1.321674 and 1.305483 for the time interval until 6/2/2017 where it should show 1.305483.
Finally, the linear interpolation should shift interpolating values when the corresponding row shows a date which is the next time interval. For example, once I reach 05/02/2017, the linear interpolation should be between 1.306358 (df1, column A) and 1.359989 (df2, column B), that is shift one position down.
For clarity, date format is 'dd/mm/yyyy'
I would greatly appreciate any piece of advice or suggestion, I am aware it's a lot of work so any hint is valued!
Please let me know if I you need more clarification.
Thanks!
I have a Pandas data frame with columns that are 'dynamic' (meaning that I don't know what the column names will be until I retrieve the data from the various databases).
The data frame is a single row and looks something like this:
Make Date Red Blue Green Black Yellow Pink Silver
89 BMW 2016-10-28 300.0 240.0 2.0 500.0 1.0 1.0 750.0
Note that '89' is that particular row in the data frame.
I have the following code:
cars_bar_plot = df_cars.loc[(df_cars.Make == 'BMW') & (df_cars.Date == as_of_date)]
cars_bar_plot = cars_bar_plot.replace(0, value=np.nan)
cars_bar_plot = cars_bar_plot.dropna(axis=1, how='all')
This works fine in helping me to create the above-mentioned single-row data frame, BUT some of the values in each column are very small (e.g. 1.0 and 2.0) relative to the other values and they are distorting a horizontal bar chart that I'm creating with Matplotlib. I'd like to get rid of numbers that are smaller than some minimum threshold value (e.g. 3.0).
Any idea how I can do that?
Thanks!
UPDATE 1
The following line of code helps, but does not fully solve the problem.
cars_bar_plot = cars_bar_plot.loc[:, (cars_bar_plot >= 3.0).any(axis=0)]
The problem is that it's eliminating unintended columns. For example, referencing the original data frame, is it possible to modify this code such that it only removes columns with a value less than 3.0 to the right of the "Black" column (under the assumption that we actually want to retain the value of 2.0 in the "Green" column)?
Thanks!
Assuming you want to only keep rows matching your criteria, you can filter you data like this:
df[df.apply(lambda x: x > 0.5).min(axis=1)]
i.e. simply look at all values matching your condition, and remove the row as soon if at least one doesn't.
Here is the answer to my question:
lower_threshold = 3.0
start_column = 5
df = df.loc[start_column:, (df >= lower_threshold).any(axis=0)]
I have a pandas.DataFrame object containing 2 time series. One series is much shorter than the other.
I want to determine the farther date for which a data is available in the shortest series, and remove data in the 2 columns before that date.
What is the most pythonic way to do that?
(I apologize that I don't really follow the SO guideline for submitting questions)
Here is a fragment of my dataframe:
osr go
Date
1990-08-17 NaN 239.75
1990-08-20 NaN 251.50
1990-08-21 352.00 265.00
1990-08-22 353.25 274.25
1990-08-23 351.75 290.25
In this case, I want to get rid of all rows before 1990-08-21 (I add there may be NAs in one of the columns for more recent dates)
You can use idxmax in inverted s by df['osr'][::-1] and then use subset of df:
print df
# osr go
#Date
#1990-08-17 NaN 239.75
#1990-08-20 NaN 251.50
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
s = df['osr'][::-1]
print s
#Date
#1990-08-23 351.75
#1990-08-22 353.25
#1990-08-21 352.00
#1990-08-20 NaN
#1990-08-17 NaN
#Name: osr, dtype: float64
maxnull = s.isnull().idxmax()
print maxnull
#1990-08-20 00:00:00
print df[df.index > maxnull]
# osr go
#Date
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
EDIT: New answer based upon comments/edits
It sounds like the data is sequential and once you have lines that don't have data you want to throw them out. This can be done easily with dropna.
df = df.dropna()
This answer assumes that once you are passed the bad rows, they stay good. Or if you don't care about dropping rows in the middle...depends on how sequential you need to be. If the data needs to be sequential and your input is well formed jezrael answer is good
Original answer
You haven't given much here by way of structure in your dataframe so I am going to make assumptions here. I'm going to assume you have many columns, two of which: time_series_1 and time_series_2 are the ones you referred to in your question and this is all stored in df
First we can find the shorter series by just using
shorter_col = df['time_series_1'] if len(df['time_series_1']) > len(df['time_series_2']) else df['time_series_2']
Now we want the last date in that
remove_date = max(shorter_col)
Now we want to remove data before that date
mask = (df['time_series_1'] > remove_date) | (df['time_series_2'] > remove_date)
df = df[mask]