pandas: calculate the daily average, grouped by label - python

I want to create a graph with lines represented by my label
so in this example picture, each line represents a distinct label
The data looks something like this where the x-axis is the datetime and the y-axis is the count.
datetime, count, label
1656140642, 12, A
1656140643, 20, B
1656140645, 11, A
1656140676, 1, B
Because I have a lot of data, I want to aggregate it by 1 hour or even 1 day chunks.
I'm able to generate the above picture with
# df is dataframe here, result from pandas.read_csv
df.set_index("datetime").groupby("label")["count"].plot
and I can get a time-range average with
df.set_index("datetime").groupby(pd.Grouper(freq='2min')).mean().plot()
but I'm unable to get both rules applied. Can someone point me in the right direction?

You can use .pivot (documentation) function to create a convenient structure where datetime is index and the different labels are the columns, with count as values.
df.set_index('datetime').pivot(columns='label', values='count')
output:
label A B
datetime
1656140642 12.0 NaN
1656140643 NaN 20.0
1656140645 11.0 NaN
1656140676 NaN 1.0
Now when you have your data in this format, you can perform simple aggregation over the index (with groupby / resample/ whatever suits you) so it will be applied each column separately. Then plotting the results is just plotting different line for each column.

Related

Plotting a cumulative sum with groupby in pandas

I'm missing something really obvious or simply doing this wrong. I have two dataframes of similar structure and I'm trying to plot a time-series of the cumulative sum of one column from both. The dataframes are indexed by date:
df1
value
2020-01-01 2435
2020-01-02 12847
...
2020-10-01 34751
The plot should be grouped by month and be a cumulative sum of the whole time range. I've tried:
line1 = df1.groupby(pd.Grouper(freq='1M')).value.cumsum()
line2 = df2.groupby(pd.Grouper(freq='1M')).value.cumsum()
and then plot, but it resets after each month. How can I change this?
I am guessing you want to group and take the mean or something to represent the cumulative value for each month, and plot:
df1 = pd.DataFrame({'value':np.random.randint(100,200,366)},
index=pd.date_range(start='1/1/2018', end='1/1/2019'))
df1.cumsum().groupby(pd.Grouper(freq='1M')).mean().plot()

How to get values about change points in facebook prophet?

I used to Facebook Prophet library, now I have a problem.
When I use add_changepoints_to_plot function, I can see red lind and red dots line about change points, but I want to get this values.
How to get a values about change points or incline?
I wanna get numerical values of moments or values of time about change points. And I need way to decision whether the trend goes up or down through values.
Welcome SO. You need to provide some code snippet.
Prophet need a data frame which contains two columns (ds and y).
While column ds contains dates, column y contains value of the date.
As far as i understand, your data have changepoints and you want to see the values on changepoint dates
I'll leave an example code snippet here assuming you have a df with column "ds" and "y":
estimator = Prophet()
estimator.fit(df)
df.loc[df["ds"].isin(estimator.changepoints)]
estimator.changepoints contains the dates which occur changepoints. If you filter these dates from your dataframe you will get changepoint values.
For example:
mdl = Prophet(yearly_seasonality=True, interval_width=0.95, n_changepoints = 5)
mdl.add_country_holidays(country_name='US')
mdl.fit(df)
mdl.changepoints
Output:
62 2021-07-06
125 2021-09-07
187 2021-11-08
250 2022-01-10
312 2022-03-13
Name: ds, dtype: datetime64[ns]

How to plot stacked time histogram starting from a Pandas DataFrame?

Consider the following DataFrame df:
Date Kind
2018-09-01 13:15:32 Red
2018-09-02 16:13:26 Blue
2018-09-04 22:10:09 Blue
2018-09-04 09:55:30 Red
... ...
In which you have a column with a datetime64[ns] dtype and another which contains a np.object which can assume only a finite number of values (in this case, 2).
You have to plot a date histogram in which you have:
On the x-axis, the dates (per-day histogram showing month and day);
On the y-axis, the number of items belonging to that date, showing in a stacked bar the difference between Blue and Red.
How is it possible to achieve this using Matplotlib?
I was thinking to do a set_index and resample as follows:
df.set_index('Date', inplace=True)
df.resample('1d').count()
But I'm losing the information on the number of items per Kind. I also want to keep any missing day as zero.
Any help very appreciated.
Use groupby, count and unstack to adjust the dataframe:
df2 = df.groupby(['Date', 'Kind'])['Kind'].count().unstack('Kind').fillna(0)
Next, re-sample the dataframe and sum the count for each day. This will also add any missing days that are not in the dataframe (as specified). Then adjust the index to only keep the date part.
df2 = df2.resample('D').sum()
df2.index = df2.index.date
Now plot the dataframe with stacked=True:
df2.plot(kind='bar', stacked=True)
Alternatively, the plt.bar() function can be used for the final plotting:
cols = df['Kind'].unique() # Find all original values in the column
ind = range(len(df2))
p1 = plt.bar(ind, df2[cols[0]])
p2 = plt.bar(ind, df2[cols[1]], bottom=df2[cols[0]])
Here it is necessary to set the bottom argument of each part to be the sum of all the parts that came before.

Pandas Rolling Correlation Introduces Gaps

I have a relatively clean data set with two columns and no gaps, a snapshot is shown below:
I run the following line of code:
correlation = pd.rolling_corr(data['A'], data['B'], window=120)
and for some reason, this outputs a dataframe (shown as a plot below) with large gaps in it:
I haven't personally seen this issue before, and am not sure after reviewing the data (more than the code) what the issue could be?
It happens due to the missing dates in the time series, weekends etc. Evidence of this in your example being 7/2/2003 -> 10/2/2003. One solution is to fill in these gaps by re-indexing the time series dataframe.
df.index = pd.DatetimeIndex(df.index) # required
df = df.asfreq('D') # reindex will include missing days
df = df.fillna(method='bfill') # fill / interpolate NaNs
corr = df.A.rolling(30).corr(df.B) # no gaps
You are getting NAN values in your correlation variable where the number of rows is less than the value of the window attribute.
import pandas as pd
import numpy as np
data = pd.DataFrame({'A':np.random.randn(10), 'B':np.random.randn(10)})
correlation = pd.rolling_corr(data['A'], data['B'], window=3)
print correlation
0 NaN
1 NaN
2 0.852602
3 0.020681
4 -0.915110
5 -0.741857
6 0.173987
7 0.874049
8 -0.874258
9 -0.835340
In the docs for this function is warns about this in the min_periods attribute section: "Minimum number of observations in window required to have a value (otherwise result is NA)."
It seems the default None is not working, since you'd think you wouldn't see the NaN unless you set a value for this.

multiple subplots on same figure with nan values

Hi so I have a function that plots timeseries data for a given argument (in my case its a country name). Now some of the columns have na values and when i try to plot them I cant because of thos NaN values. How can I solve this problem?
This is the code, which gets you dataframe and function im using:
url2='https://spreadsheets.google.com/pub?key=phAwcNAVuyj1jiMAkmq1iMg&output=xls'
source=io.BytesIO(requests.get(url2).content)
income=pd.read_excel(source)
income.head()
income.set_index("GDP per capita", inplace=True)
def gdpchange(country):
dfff=income.loc[country]
dfff.T.plot(kind='line')
plt.legend([country])
Now if I want to plot all of them on one graph it gives an error because of nan values in some columns. Any suggestions?
for ctr in income.index.values:
gdpchange(ctr)
You have to drop all nan values with pandas.dropna():
income.dropna(inplace=True)
This statement drops all rows that have any nan values in income dataframe.

Categories

Resources