Problem with SARIMAX plotting in matplotlib [duplicate] - python

I'm using matplotlib to display a stock's price movements over time. I want to focus on the last 90 days and then predict the next 14 days. I have the last 90 days of data and my predictions, but I want to graph my predictions in a different color, so it's clear they're different.
How would I do this?
If I just add a second plot() call to my code, the predictions will start from the same point as my 90 days of data and be overlaid, which isn't what I want.
Right now I'm doing this:
df[-90:]["price"].plot()
plt.show()
Thanks!

Hopefully this is what you want:
import pandas as pd
import numpy as np; np.random.seed(1)
import matplotlib.pyplot as plt
datelist = pd.date_range(pd.datetime(2018, 1, 1), periods=104)
df = pd.DataFrame(np.cumsum(np.random.randn(104)),
columns=['price'], index=datelist)
plt.plot(df[:90].index, df[:90].values)
plt.plot(df[90:].index, df[90:].values)
# If you don't like the break in the graph, change 90 to 89 in the above line
plt.gcf().autofmt_xdate()
plt.show()

import matplotlib.pyplot as plt
import numpy as np
last90days = np.random.rand(90)
next14days = np.random.rand(14)
plt.plot(np.arange(90), last90days)
plt.plot(np.arange(90, 90+14), next14days)
plt.show()

Short answer:
Use pd.merge() and make good use of missing values in two different series to get two lines with different colors. This suggestion will be very flexible with regards to what type of dataframe index you're using (dates, integers og strings). This is what you'll get:
Long answer:
About the detail regarding...
I want to focus on the last 90 days and then predict the next 14 days.
... I'm going to assume that you are using a dataframe with a daily index. I'm also assuming that you know the index values of your dataset with 90 days and your dataset with 14 days.
Here's a dataframe with 104 observations (random data):
Snippet 1:
import pandas as pd
import numpy as np
np.random.seed(12)
rows = 104
df = pd.DataFrame(np.random.randint(-4,5,size=(rows, 1)), columns=['data'])
datelist = pd.date_range(pd.datetime(2018, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
df = df.cumsum()
df.plot()
Plot 1:
To replicate your setup, I've split the dataframe into two different frames with 90 observations (price) and 14 days (predictions). This way, you'll have two different datasets, but the associated index will be contiuous - which I assume is your actual situation.
Snippet 2:
df_90 = df[:90].copy(deep = True)
df_14 = df[-14:].copy(deep = True)
df_90.columns = ['price']
df_14.columns = ['predictions']
df_90.plot()
df_14.plot()
Plot 2:
Now you can merge them together so that you'll get a dataframe with two columns (data and predictions). Of course you'll end up with some missing data, but that is exactly what is going to give you two lines with different colors when you plot it.
Snippet 3:
df_all = pd.merge(df_90, df_14, how = 'outer', left_index=True, right_index=True)
df_all.plot()
Plot 3:
I hope the suggested solution matches your real situation. Let me know if the details about the index will be an issue, and I'll take a look at that as well.
Here's the complete code for an easy copy-paste:
import pandas as pd
import numpy as np
np.random.seed(12)
rows = 104
df = pd.DataFrame(np.random.randint(-4,5,size=(rows, 1)), columns=['data'])
datelist = pd.date_range(pd.datetime(2018, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
df = df.cumsum()
df.plot()
df_90 = df[:90].copy(deep = True)
df_14 = df[-14:].copy(deep = True)
df_90.columns = ['price']
df_14.columns = ['predictions']
df_90.plot()
df_14.plot()
df_all = pd.merge(df_90, df_14, how = 'outer', left_index=True, right_index=True)
df_all.plot()

Related

Plotly Distplot subplots

I am trying to write a for loop that for distplot subplots.
I have a dataframe with many columns of different lengths. (not including the NaN values)
fig = make_subplots(
rows=len(assets), cols=1,
y_title = 'Hourly Price Distribution')
i=1
for col in df_all.columns:
fig = ff.create_distplot([[df_all[[col]].dropna()]], col)
fig.append()
i+=1
fig.show()
I am trying to run a for loop for subplots for distplots and get the following error:
PlotlyError: Oops! Your data lists or ndarrays should be the same length.
UPDATE:
This is an example below:
df = pd.DataFrame({'2012': np.random.randn(20),
'2013': np.random.randn(20)+1})
df['2012'].iloc[0] = np.nan
fig = ff.create_distplot([df[c].dropna() for c in df.columns],
df.columns,show_hist=False,show_rug=False)
fig.show()
I would like to plot each distribution in a different subplot.
Thank you.
Update: Distribution plots
Calculating the correct values is probably both quicker and more elegant using numpy. But I often build parts of my graphs using one plotly approach(figure factory, plotly express) and then use them with other elements of the plotly library (plotly.graph_objects) to get what I want. The complete snippet below shows you how to do just that in order to build a go based subplot with elements from ff.create_distplot. I'd be happy to give further explanations if the following suggestion suits your needs.
Plot
Complete code
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.graph_objects as go
df = pd.DataFrame({'2012': np.random.randn(20),
'2013': np.random.randn(20)+1})
df['2012'].iloc[0] = np.nan
df = df.reset_index()
dfm = pd.melt(df, id_vars=['index'], value_vars=df.columns[1:])
dfm = dfm.dropna()
dfm.rename(columns={'variable':'year'}, inplace = True)
cols = dfm.year.unique()
nrows = len(cols)
fig = make_subplots(rows=nrows, cols=1)
for r, col in enumerate(cols, 1):
dfs = dfm[dfm['year']==col]
fx1 = ff.create_distplot([dfs['value'].values], ['distplot'],curve_type='kde')
fig.add_trace(go.Scatter(
x= fx1.data[1]['x'],
y =fx1.data[1]['y'],
), row = r, col = 1)
fig.show()
First suggestion
You should:
1. Restructure your data with pd.melt(df, id_vars=['index'], value_vars=df.columns[1:]),
2. and the use the occuring column 'variable' to build subplots for each year through the facet_row argument to get this:
In the complete snippet below you'll see that I've changed 'variable' to 'year' in order to make the plot more intuitive. There's one particularly convenient side-effect with this approach, namely that running dfm.dropna() will remove the na value for 2012 only. If you were to do the same thing on your original dataframe, the corresponding value in the same row for 2013 would also be removed.
import numpy as np
import pandas as pd
import plotly.express as px
df = pd.DataFrame({'2012': np.random.randn(20),
'2013': np.random.randn(20)+1})
df['2012'].iloc[0] = np.nan
df = df.reset_index()
dfm = pd.melt(df, id_vars=['index'], value_vars=df.columns[1:])
dfm = dfm.dropna()
dfm.rename(columns={'variable':'year'}, inplace = True)
fig = px.histogram(dfm, x="value",
facet_row = 'year')
fig.show()

how to make line charts by iterating pandas columns in python?

I have weekly time-series data that I want to make a weekly line chart using matplotlib/seaborn. To do so, I did aggregate given time series data correctly and tried to make plots, but the output was not correct to me. Essentially, in my data, columns are the list of countries, and the index is the weekly time index. What I wanted to do is, first iterate pandas columns by each country then group it by year and week, so I could have a weekly linechart for each countries. The way of aggregating my data is bit inefficient, which I assume gave me the problem. Can anyone suggest me possible way of doing this? Any way to get line chart by iterating pandas columns where grouping its time index? Any idea?
my attempt and data
import pandas as pd
import matplotlib.pyplot as plt
url = 'https://gist.githubusercontent.com/adamFlyn/7c96d7f7c05f16abcc39befcd74f5ca8/raw/8997332cd3cdec7610aeaa0300a1b85f9daafccb/prod_sales.csv'
df = pd.read_csv(url, parse_dates=['date'])
df.drop(columns=['Unnamed: 0'], inplace=True)
df1_bf.index = pd.to_datetime(df1_bf.index, errors="coerce")
df1_bf.index.name = 'date'
df1_bf.reset_index('date')
df1_bf['year'] = pd.DatetimeIndex(df1_bf.index).year
df1_bf['week'] = pd.DatetimeIndex(df1_bf.index).week
for i in df1_bf.columns:
df_grp = df1.groupby(['year', 'week'])[i].sum().unstack()
fig,ax1 = plt.subplots(nrows=1,ncols=1,squeeze=True,figsize=(16,10))
for j in df_grp['year']:
ax1.plot(df_grp.week, j, next(linecycler),linewidth=3)
plt.gcf().autofmt_xdate()
plt.style.use('ggplot')
plt.xticks(rotation=0)
plt.show()
plt.close()
but I couldn't get the correct plot by attempting the above. Seems I might wrong with data aggregation part for making plot data. Can anyone suggest me possible way of making this right? any thoughts?
desired output
This is the example plot that I want to make. I want to iterate pandas columns then group its timeindex, so I want to get line chart of weekly time series for each country in loop.
how should I get this desired plot? Is there any way of doing this right with matplotlib or seaborn? Any idea?
You need to melt your dataframe and then groupby. Then, use Seaborn to create a plot, passing the data, x, y and hue. Passing hue allows you to avoid looping and makes it a lot cleaner:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://gist.githubusercontent.com/adamFlyn/7c96d7f7c05f16abcc39befcd74f5ca8/raw/8997332cd3cdec7610aeaa0300a1b85f9daafccb/prod_sales.csv'
df = pd.read_csv(url, parse_dates=['Unnamed: 0'])
df = df.rename({'Unnamed: 0' : 'date'}, axis=1)
df['year'] = df['date'].dt.year
df['week'] = df['date'].dt.week
df = df.melt(id_vars=['date','week','year'])
df = df.groupby(['year', 'week'], as_index=False)['value'].sum()
fig, ax = plt.subplots(squeeze=True,figsize=(16,10))
sns.lineplot(data=df, x='week', y='value', hue='year',linewidth=3)
plt.show()
This is the first and last 5 rows of df before plotting:
year week value
0 2018 1 2268.0
1 2019 1 11196.0
2 2019 2 0.0
3 2019 3 0.0
4 2019 4 0.0
.. ... ... ...
100 2020 49 17111.0
101 2020 50 18203.0
102 2020 51 12787.0
103 2020 52 26245.0
104 2020 53 11772.0
Per your comment, you are looking for relplot and pass kind='line'. There are all sorts of formatting parameters you can pass with relplot or you can search how to loop through the axes to make more changes:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://gist.githubusercontent.com/adamFlyn/7c96d7f7c05f16abcc39befcd74f5ca8/raw/8997332cd3cdec7610aeaa0300a1b85f9daafccb/prod_sales.csv'
df = pd.read_csv(url, parse_dates=['Unnamed: 0'])
df = df.rename({'Unnamed: 0' : 'date'}, axis=1)
df['year'] = df['date'].dt.year
df['week'] = df['date'].dt.isocalendar().week
df = df.melt(id_vars=['date','week','year'], var_name='country')
df = df.loc[df['value'] < 3000].groupby(['country', 'year', 'week'], as_index=False)['value'].sum()
sns.relplot(data=df, x='week', y='value', hue='year', row='country', kind='line', facet_kws={'sharey': False, 'sharex': True})
df

How to plot multiple data one after another in the same graph using Python Pandas DataFrame

I was trying to visualize a facebook stock dataset, where the data for 2014 to 2018 is stored. The dataset looks like this: dataset screenshot
My goal is to visualize the closing column, but by year. That is, year 2014, then 2015 and so on, but they should be in one figure, and one after another. Something like this: expected graph image
But whatever I try, all the graph parts start from index 0, instead of continuing from the end of the previous one. Here's what I got: the graph I generated
Please help me to solve this problem. Thanks!
The most straightforward way is simply to create separate dataframes with empty
values for the non-needed dates.
Here I use an example dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.random.randint(0, 100, size=100),
index=pd.date_range(start="2020-01-01", periods=100, freq="D"),
)
Then you can create and select the data to plot
df1 = df.copy()
df2 = df.copy()
df1[df.index > pd.to_datetime('2020-02-01')] = np.NaN
df2[df.index < pd.to_datetime('2020-02-01')] = np.NaN
And then simply plot these on the same axis.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1, figsize=(18, 8))
ax.plot(df1)
ax.plot(df2)
The result

Pandas 2 dataframes into one graph

I have 2 separate dataframes that look exactly the same but with different numbers in it
df = pd.DataFrame({'clip emotes':[79,223,435,291,188,99,153,50,55,78,83,48,43,73]}, index=['roohappy','rooblank','lul','omegalul','pog','pogchamp','roovv','roowut','roopog','pepehands','biblethumb','roocry','rooree','rooblind'])
df
and
df = pd.DataFrame({'vod emotes':[3963,7286,5560,4390,3386,3111,2639,2612,2422,1999,1948,1691,1654,1573,1308,1090,1024,1019,1019,974,945,912,893,856,790,771,731,677,658,652]}, index=['rood','roovv','pepega','lul','clap','rookek','roocult','rooblank','pog','rooree','rooaww','roohappy','omegaroll','rooduck','rooh','rareroo','roocry','pepehand','lulw','rooderp','roopog','hyperclap','roospy','rooayaya','omegalul','roolove','roowut','roonya','monkas','roo4'])
df
and then I do df.plot(kind = 'bar') for both of the separately. I cant figure out how can I put these two datas into a one graph one over the other so that one bar with the same name would be over the other with a different colour.
You can do it by joining them:
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'clip emotes':[79,223,435,291,188,99,153,50,55,78,83,48,43,73]}, index=['roohappy','rooblank','lul','omegalul','pog','pogchamp','roovv','roowut','roopog','pepehands','biblethumb','roocry','rooree','rooblind'])
df2 = pd.DataFrame({'vod emotes':[3963,7286,5560,4390,3386,3111,2639,2612,2422,1999,1948,1691,1654,1573,1308,1090,1024,1019,1019,974,945,912,893,856,790,771,731,677,658,652]}, index=['rood','roovv','pepega','lul','clap','rookek','roocult','rooblank','pog','rooree','rooaww','roohappy','omegaroll','rooduck','rooh','rareroo','roocry','pepehand','lulw','rooderp','roopog','hyperclap','roospy','rooayaya','omegalul','roolove','roowut','roonya','monkas','roo4'])
df3 = df2.join(df1)
df3.plot(kind='bar', stacked=True)
plt.tight_layout()

Box plot for continuous data in Python

I have a csv file with 2 columns:
col1- Timestamp data(yyyy-mm-dd hh:mm:ss.ms (8 months data))
col2 : Heat data (continuous variable) .
Since there are almost 50k record, I would like to partition the col1(timestamp col) into months or weeks and then apply box plot on the heat data w.r.t timestamp.
I tried in R,it takes a long time. Need help to do in Python. I think I need to use seaborn.boxplot.
Please guide.
Group by Frequency then plot groups
First Read your csv data into a Pandas DataFrame
import numpy as np
import Pandas as pd
from matplotlib import pyplot as plt
# assumes NO header line in csv
df = pd.read_csv('\file\path', names=['time','temp'], parse_dates=[0])
I will use some fake data, 30 days of hourly samples.
heat = np.random.random(24*30) * 100
dates = pd.date_range('1/1/2011', periods=24*30, freq='H')
df = pd.DataFrame({'time':dates,'temp':heat})
Set the timestamps as the DataFrame's index
df = df.set_index('time')
Now group by by the period you want, seven days for this example
gb = df.groupby(pd.Grouper(freq='7D'))
Now you can plot each group separately
for g, week in gb2:
#week.plot()
week.boxplot()
plt.title(f'Week Of {g.date()}')
plt.show()
plt.close()
And... I didn't realize you could do this but it is pretty cool
ax = gb.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=30)
plt.show()
plt.close()
heat = np.random.random(24*300) * 100
dates = pd.date_range('1/1/2011', periods=24*300, freq='H')
df = pd.DataFrame({'time':dates,'temp':heat})
df = df.set_index('time')
To partition the data in five time periods then get weekly boxplots of each:
Determine the total timespan; divide by five; create a frequency alias; then groupby
dt = df.index[-1] - df.index[0]
dt = dt/5
alias = f'{dt.total_seconds()}S'
gb = df.groupby(pd.Grouper(freq=alias))
Each group is a DataFrame so iterate over the groups; create weekly groups from each and boxplot them.
for g,d_frame in gb:
gb_tmp = d_frame.groupby(pd.Grouper(freq='7D'))
ax = gb_tmp.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=90)
plt.show()
plt.close()
There might be a better way to do this, if so I'll post it or maybe someone will fill free to edit this. Looks like this could lead to the last group not having a full set of data. ...
If you know that your data is periodic you can just use slices to split it up.
n = len(df) // 5
for tmp_df in (df[i:i+n] for i in range(0, len(df), n)):
gb_tmp = tmp_df.groupby(pd.Grouper(freq='7D'))
ax = gb_tmp.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=90)
plt.show()
plt.close()
Frequency aliases
pandas.read_csv()
pandas.Grouper()

Categories

Resources