how to make line charts by iterating pandas columns in python?

how to make line charts by iterating pandas columns in python? - python

I have weekly time-series data that I want to make a weekly line chart using matplotlib/seaborn. To do so, I did aggregate given time series data correctly and tried to make plots, but the output was not correct to me. Essentially, in my data, columns are the list of countries, and the index is the weekly time index. What I wanted to do is, first iterate pandas columns by each country then group it by year and week, so I could have a weekly linechart for each countries. The way of aggregating my data is bit inefficient, which I assume gave me the problem. Can anyone suggest me possible way of doing this? Any way to get line chart by iterating pandas columns where grouping its time index? Any idea?
my attempt and data
import pandas as pd
import matplotlib.pyplot as plt
url = 'https://gist.githubusercontent.com/adamFlyn/7c96d7f7c05f16abcc39befcd74f5ca8/raw/8997332cd3cdec7610aeaa0300a1b85f9daafccb/prod_sales.csv'
df = pd.read_csv(url, parse_dates=['date'])
df.drop(columns=['Unnamed: 0'], inplace=True)
df1_bf.index = pd.to_datetime(df1_bf.index, errors="coerce")
df1_bf.index.name = 'date'
df1_bf.reset_index('date')
df1_bf['year'] = pd.DatetimeIndex(df1_bf.index).year
df1_bf['week'] = pd.DatetimeIndex(df1_bf.index).week
for i in df1_bf.columns:
df_grp = df1.groupby(['year', 'week'])[i].sum().unstack()
fig,ax1 = plt.subplots(nrows=1,ncols=1,squeeze=True,figsize=(16,10))
for j in df_grp['year']:
ax1.plot(df_grp.week, j, next(linecycler),linewidth=3)
plt.gcf().autofmt_xdate()
plt.style.use('ggplot')
plt.xticks(rotation=0)
plt.show()
plt.close()
but I couldn't get the correct plot by attempting the above. Seems I might wrong with data aggregation part for making plot data. Can anyone suggest me possible way of making this right? any thoughts?
desired output
This is the example plot that I want to make. I want to iterate pandas columns then group its timeindex, so I want to get line chart of weekly time series for each country in loop.
how should I get this desired plot? Is there any way of doing this right with matplotlib or seaborn? Any idea?

You need to melt your dataframe and then groupby. Then, use Seaborn to create a plot, passing the data, x, y and hue. Passing hue allows you to avoid looping and makes it a lot cleaner:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://gist.githubusercontent.com/adamFlyn/7c96d7f7c05f16abcc39befcd74f5ca8/raw/8997332cd3cdec7610aeaa0300a1b85f9daafccb/prod_sales.csv'
df = pd.read_csv(url, parse_dates=['Unnamed: 0'])
df = df.rename({'Unnamed: 0' : 'date'}, axis=1)
df['year'] = df['date'].dt.year
df['week'] = df['date'].dt.week
df = df.melt(id_vars=['date','week','year'])
df = df.groupby(['year', 'week'], as_index=False)['value'].sum()
fig, ax = plt.subplots(squeeze=True,figsize=(16,10))
sns.lineplot(data=df, x='week', y='value', hue='year',linewidth=3)
plt.show()
This is the first and last 5 rows of df before plotting:
year week value
0 2018 1 2268.0
1 2019 1 11196.0
2 2019 2 0.0
3 2019 3 0.0
4 2019 4 0.0
.. ... ... ...
100 2020 49 17111.0
101 2020 50 18203.0
102 2020 51 12787.0
103 2020 52 26245.0
104 2020 53 11772.0
Per your comment, you are looking for relplot and pass kind='line'. There are all sorts of formatting parameters you can pass with relplot or you can search how to loop through the axes to make more changes:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://gist.githubusercontent.com/adamFlyn/7c96d7f7c05f16abcc39befcd74f5ca8/raw/8997332cd3cdec7610aeaa0300a1b85f9daafccb/prod_sales.csv'
df = pd.read_csv(url, parse_dates=['Unnamed: 0'])
df = df.rename({'Unnamed: 0' : 'date'}, axis=1)
df['year'] = df['date'].dt.year
df['week'] = df['date'].dt.isocalendar().week
df = df.melt(id_vars=['date','week','year'], var_name='country')
df = df.loc[df['value'] < 3000].groupby(['country', 'year', 'week'], as_index=False)['value'].sum()
sns.relplot(data=df, x='week', y='value', hue='year', row='country', kind='line', facet_kws={'sharey': False, 'sharex': True})
df

Related

Problem with SARIMAX plotting in matplotlib [duplicate]

I'm using matplotlib to display a stock's price movements over time. I want to focus on the last 90 days and then predict the next 14 days. I have the last 90 days of data and my predictions, but I want to graph my predictions in a different color, so it's clear they're different.
How would I do this?
If I just add a second plot() call to my code, the predictions will start from the same point as my 90 days of data and be overlaid, which isn't what I want.
Right now I'm doing this:
df[-90:]["price"].plot()
plt.show()
Thanks!

Hopefully this is what you want:
import pandas as pd
import numpy as np; np.random.seed(1)
import matplotlib.pyplot as plt
datelist = pd.date_range(pd.datetime(2018, 1, 1), periods=104)
df = pd.DataFrame(np.cumsum(np.random.randn(104)),
columns=['price'], index=datelist)
plt.plot(df[:90].index, df[:90].values)
plt.plot(df[90:].index, df[90:].values)
# If you don't like the break in the graph, change 90 to 89 in the above line
plt.gcf().autofmt_xdate()
plt.show()

import matplotlib.pyplot as plt
import numpy as np
last90days = np.random.rand(90)
next14days = np.random.rand(14)
plt.plot(np.arange(90), last90days)
plt.plot(np.arange(90, 90+14), next14days)
plt.show()

Short answer:
Use pd.merge() and make good use of missing values in two different series to get two lines with different colors. This suggestion will be very flexible with regards to what type of dataframe index you're using (dates, integers og strings). This is what you'll get:
Long answer:
About the detail regarding...
I want to focus on the last 90 days and then predict the next 14 days.
... I'm going to assume that you are using a dataframe with a daily index. I'm also assuming that you know the index values of your dataset with 90 days and your dataset with 14 days.
Here's a dataframe with 104 observations (random data):
Snippet 1:
import pandas as pd
import numpy as np
np.random.seed(12)
rows = 104
df = pd.DataFrame(np.random.randint(-4,5,size=(rows, 1)), columns=['data'])
datelist = pd.date_range(pd.datetime(2018, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
df = df.cumsum()
df.plot()
Plot 1:
To replicate your setup, I've split the dataframe into two different frames with 90 observations (price) and 14 days (predictions). This way, you'll have two different datasets, but the associated index will be contiuous - which I assume is your actual situation.
Snippet 2:
df_90 = df[:90].copy(deep = True)
df_14 = df[-14:].copy(deep = True)
df_90.columns = ['price']
df_14.columns = ['predictions']
df_90.plot()
df_14.plot()
Plot 2:
Now you can merge them together so that you'll get a dataframe with two columns (data and predictions). Of course you'll end up with some missing data, but that is exactly what is going to give you two lines with different colors when you plot it.
Snippet 3:
df_all = pd.merge(df_90, df_14, how = 'outer', left_index=True, right_index=True)
df_all.plot()
Plot 3:
I hope the suggested solution matches your real situation. Let me know if the details about the index will be an issue, and I'll take a look at that as well.
Here's the complete code for an easy copy-paste:
import pandas as pd
import numpy as np
np.random.seed(12)
rows = 104
df = pd.DataFrame(np.random.randint(-4,5,size=(rows, 1)), columns=['data'])
datelist = pd.date_range(pd.datetime(2018, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
df = df.cumsum()
df.plot()
df_90 = df[:90].copy(deep = True)
df_14 = df[-14:].copy(deep = True)
df_90.columns = ['price']
df_14.columns = ['predictions']
df_90.plot()
df_14.plot()
df_all = pd.merge(df_90, df_14, how = 'outer', left_index=True, right_index=True)
df_all.plot()

How to create a yearly bar plot grouped by months

I'm having a difficult time trying to create a bar plot with and DataFrame grouped by year and month. With the following code I'm trying to plot the data in the created image, instead of that, is returning a second image. Also I tried to move the legend to the right and change its values to the corresponding month.
I started to get a feel for the DataFrames obtained with the groupby command, though not getting what I expected led me to ask you guys.
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
df = pd.read_csv('fcc-forum-pageviews.csv', index_col='date')
line_plot = df.value[(df.value > df.value.quantile(0.025)) & (df.value < df.value.quantile(0.975))]
fig, ax = plt.subplots(figsize=(10,10))
bar_plot = line_plot.groupby([line_plot.index.year, line_plot.index.month]).mean().unstack()
bar_plot.plot(kind='bar')
ax.set_xlabel('Years')
ax.set_ylabel('Average Page Views')
plt.show()
This is the format of the data that I am analyzing.
date,value
2016-05-09,1201
2016-05-10,2329
2016-05-11,1716
2016-05-12,10539
2016-05-13,6933

Add a sorted categorical 'month' column with pd.Categorical
Transform the dataframe to a wide format with pd.pivot_table where aggfunc='mean' is the default.
Wide format is typically best for plotting grouped bars.
pandas.DataFrame.plot returns matplotlib.axes.Axes, so there's no need to use fig, ax = plt.subplots(figsize=(10,10)).
The pandas .dt accessor is used to extract various components of 'date', which must be a datetime dtype
If 'date' is not a datetime dtype, then transform it with df.date = pd.to_datetime(df.date).
Tested with python 3.8.11, pandas 1.3.1, and matplotlib 3.4.2
Imports and Test Data
import pandas as pd
from calendar import month_name # conveniently supplies a list of sorted month names or you can type them out manually
import numpy as np # for test data
# test data and dataframe
np.random.seed(365)
rows = 365 * 3
data = {'date': pd.bdate_range('2021-01-01', freq='D', periods=rows), 'value': np.random.randint(100, 1001, size=(rows))}
df = pd.DataFrame(data)
# select data within specified quantiles
df = df[df.value.gt(df.value.quantile(0.025)) & df.value.lt(df.value.quantile(0.975))]
# display(df.head())
date value
0 2021-01-01 694
1 2021-01-02 792
2 2021-01-03 901
3 2021-01-04 959
4 2021-01-05 528
Transform and Plot
If 'date' has been set to the index, as stated in the comments, use the following:
df['months'] = pd.Categorical(df.index.strftime('%B'), categories=months, ordered=True)
# create the month column
months = month_name[1:]
df['months'] = pd.Categorical(df.date.dt.strftime('%B'), categories=months, ordered=True)
# pivot the dataframe into the correct shape
dfp = pd.pivot_table(data=df, index=df.date.dt.year, columns='months', values='value')
# display(dfp.head())
months January February March April May June July August September October November December
date
2021 637.9 595.7 569.8 508.3 589.4 557.7 508.2 545.7 560.3 526.2 577.1 546.8
2022 567.9 521.5 625.5 469.8 582.6 627.3 630.4 474.0 544.1 609.6 526.6 572.1
2023 521.1 548.5 484.0 528.2 473.3 547.7 525.3 522.4 424.7 561.3 513.9 602.3
# plot
ax = dfp.plot(kind='bar', figsize=(12, 4), ylabel='Mean Page Views', xlabel='Year', rot=0)
_ = ax.legend(bbox_to_anchor=(1, 1.02), loc='upper left')

Just pass the ax you defined to pandas:
bar_plot.plot(ax = ax, kind='bar')
If you also want to replace months numbers with names, you have to get those labels, replace numbers with names and re-define the legend by passing to it the new labels:
handles, labels = ax.get_legend_handles_labels()
new_labels = [datetime.date(1900, int(monthinteger), 1).strftime('%B') for monthinteger in labels]
ax.legend(handles = handles, labels = new_labels, loc = 'upper left', bbox_to_anchor = (1.02, 1))
Complete Code
import pandas as pd
from matplotlib import pyplot as plt
import datetime
df = pd.read_csv('fcc-forum-pageviews.csv')
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
line_plot = df.value[(df.value > df.value.quantile(0.025)) & (df.value < df.value.quantile(0.975))]
fig, ax = plt.subplots(figsize=(10,10))
bar_plot = line_plot.groupby([line_plot.index.year, line_plot.index.month]).mean().unstack()
bar_plot.plot(ax = ax, kind='bar')
ax.set_xlabel('Years')
ax.set_ylabel('Average Page Views')
handles, labels = ax.get_legend_handles_labels()
new_labels = [datetime.date(1900, int(monthinteger), 1).strftime('%B') for monthinteger in labels]
ax.legend(handles = handles, labels = new_labels, loc = 'upper left', bbox_to_anchor = (1.02, 1))
plt.show()
(plot generated with fake data)

How to make a line plot from a pandas dataframe with a long or wide format

(This is a self-answered post to help others shorten their answers to plotly questions by not having to explain how plotly best handles data of long and wide format)
I'd like to build a plotly figure based on a pandas dataframe in as few lines as possible. I know you can do that using plotly.express, but this fails for what I would call a standard pandas dataframe; an index describing row order, and column names describing the names of a value in a dataframe:
Sample dataframe:
a b c
0 100.000000 100.000000 100.000000
1 98.493705 99.421400 101.651437
2 96.067026 98.992487 102.917373
3 95.200286 98.313601 102.822664
4 96.691675 97.674699 102.378682
An attempt:
fig=px.line(x=df.index, y = df.columns)
This raises an error:
ValueError: All arguments should have the same length. The length of argument y is 3, whereas the length of previous arguments ['x'] is 100`

Here you've tried to use a pandas dataframe of a wide format as a source for px.line.
And plotly.express is designed to be used with dataframes of a long format, often referred to as tidy data (and please take a look at that. No one explains it better that Wickham). Many, particularly those injured by years of battling with Excel, often find it easier to organize data in a wide format. So what's the difference?
Wide format:
data is presented with each different data variable in a separate column
each column has only one data type
missing values are often represented by np.nan
works best with plotly.graphobjects (go)
lines are often added to a figure using fid.add_traces()
colors are normally assigned to each trace
Example:
a b c
0 -1.085631 0.997345 0.282978
1 -2.591925 0.418745 1.934415
2 -5.018605 -0.010167 3.200351
3 -5.885345 -0.689054 3.105642
4 -4.393955 -1.327956 2.661660
5 -4.828307 0.877975 4.848446
6 -3.824253 1.264161 5.585815
7 -2.333521 0.328327 6.761644
8 -3.587401 -0.309424 7.668749
9 -5.016082 -0.449493 6.806994
Long format:
data is presented with one column containing all the values and another column listing the context of the value
missing values are simply not included in the dataset.
works best with plotly.express (px)
colors are set by a default color cycle and are assigned to each unique variable
Example:
id variable value
0 0 a -1.085631
1 1 a -2.591925
2 2 a -5.018605
3 3 a -5.885345
4 4 a -4.393955
... ... ... ...
295 95 c -4.259035
296 96 c -5.333802
297 97 c -6.211415
298 98 c -4.335615
299 99 c -3.515854
How to go from wide to long?
df = pd.melt(df, id_vars='id', value_vars=df.columns[:-1])
The two snippets below will produce the very same plot:
How to use px to plot long data?
fig = px.line(df, x='id', y='value', color='variable')
How to use go to plot wide data?
colors = px.colors.qualitative.Plotly
fig = go.Figure()
fig.add_traces(go.Scatter(x=df['id'], y = df['a'], mode = 'lines', line=dict(color=colors[0])))
fig.add_traces(go.Scatter(x=df['id'], y = df['b'], mode = 'lines', line=dict(color=colors[1])))
fig.add_traces(go.Scatter(x=df['id'], y = df['c'], mode = 'lines', line=dict(color=colors[2])))
fig.show()
By the looks of it, go is more complicated and offers perhaps more flexibility? Well, yes. And no. You can easily build a figure using px and add any go object you'd like!
Complete go snippet:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
# dataframe of a wide format
np.random.seed(123)
X = np.random.randn(100,3)
df=pd.DataFrame(X, columns=['a','b','c'])
df=df.cumsum()
df['id']=df.index
# plotly.graph_objects
colors = px.colors.qualitative.Plotly
fig = go.Figure()
fig.add_traces(go.Scatter(x=df['id'], y = df['a'], mode = 'lines', line=dict(color=colors[0])))
fig.add_traces(go.Scatter(x=df['id'], y = df['b'], mode = 'lines', line=dict(color=colors[1])))
fig.add_traces(go.Scatter(x=df['id'], y = df['c'], mode = 'lines', line=dict(color=colors[2])))
fig.show()
Complete px snippet:
import numpy as np
import pandas as pd
import plotly.express as px
from plotly.offline import iplot
# dataframe of a wide format
np.random.seed(123)
X = np.random.randn(100,3)
df=pd.DataFrame(X, columns=['a','b','c'])
df=df.cumsum()
df['id']=df.index
# dataframe of a long format
df = pd.melt(df, id_vars='id', value_vars=df.columns[:-1])
# plotly express
fig = px.line(df, x='id', y='value', color='variable')
fig.show()

I'm going to add this as answer so it will be on evidence.
First of all thank you #vestland for this. It's a question that come over and over so it's good to have this addressed and it could be easier to flag duplicated question.
Plotly Express now accepts wide-form and mixed-form data
as you can check in this post.

You can change the pandas plotting backend to use plotly:
import pandas as pd
pd.options.plotting.backend = "plotly"
Then, to get a fig all you need to write is:
fig = df.plot()
fig.show() displays the above image.

How to select a specific date range in a csv file with pandas (again)?

I looked at the responses to this original question (see here but doesn't seem to solve my issue.)
import pandas as pd
import pandas_datareader.data
import datetime
import matplotlib.pyplot as plt
df = pd.read_csv(mypath + filename, \
skiprows=4,index_col=0,usecols=['Day', 'Cushing OK Crude Oil Future Contract 1 Dollars per Barrel'], \
skipfooter=0,engine='python')
df.index = pd.to_datetime(df.index)
fig = plt.figure(figsize=plt.figaspect(0.25))
ax = fig.add_subplot(1,1,1)
ax.grid(axis='y',color='lightgrey', linestyle='--', linewidth=0.5)
ax.grid(axis='x',color='lightgrey', linestyle='none', linewidth=0.5)
df['Cushing OK Crude Oil Future Contract 1 Dollars per
Barrel'].plot(ax=ax,grid = True, \
color='blue',fontsize=14,legend=False)
plt.show()
The graph turns out fine but I can't figure out a way to show only a certain date range. I have tried everything.
type(df) = pandas.core.frame.DataFrame
type(df.index) = pandas.core.indexes.datetimes.DatetimeIndex
also, the format for the column 'Day' is YYYY-MM-DD

Assuming you have a datetime index on your dataframe (it looks that way), you can slice using .loc like so:
% matplotlib inline
import pandas as pd
import numpy as np
data = pd.DataFrame({'values': np.random.rand(31)}, index = pd.date_range('2018-01-01', '2018-01-31'))
# Plot the entire dataframe.
data.plot()
# Plot a slice of the dataframe.
data.loc['2018-01-05':'2018-01-10', 'values'].plot(legend = False)
Gives:
The orange series is the slice.

plotting data for different days on a single HH:MM:SS axis

The DataFrame has timestamped data and I want to visually compare the daily temporal evolution of the data. If I groupby day and plot the graphs; they are obviously displaced horizontaly in time due to differences in their dates.
I want to plot a date agnostic graph of the day wise trends on a time only axis. Towards that end I have resorted to shifting the data back by an appropriate number of days as demonstrated in the following code
import pandas as pd
import datetime
import matplotlib.pyplot as plt
index1 = pd.date_range('20141201', freq='H', periods=2)
index2 = pd.date_range('20141210', freq='2H', periods=4)
index3 = pd.date_range('20141220', freq='3H', periods=5)
index = index1.append([index2, index3])
df = pd.DataFrame(list(range(1, len(index)+1)), index=index, columns=['a'])
gbyday = df.groupby(df.index.day)
first_day = gbyday.keys.min() # convert all data to this day
plt.figure()
ax = plt.gca()
for n,g in gbyday:
g.shift(-(n-first_day+1), 'D').plot(ax=ax, style='o-', label=str(n))
plt.show()
resulting in the following plot
Question: Is this the pandas way of doing it? In other words how can I achieve this more elegantly?

You can select the hour attribute of the index after grouping like this:
In [36]: fig, ax = plt.subplots()
In [35]: for label, s in gbyday:
....: ax.plot(s.index.hour, s, 'o-', label=label)

It might be a little too late for this answer, but in case anyone is still looking for it.
This solution works on different months (it was an issue if using the code from the original question) and keeps fractional hours.
import pandas as pd
import matplotlib.pyplot as plt
index0 = pd.date_range('20141101', freq='H', periods=2)
index1 = pd.date_range('20141201', freq='H', periods=2)
index2 = pd.date_range('20141210', freq='2H', periods=4)
index3 = pd.date_range('20141220', freq='3H', periods=5)
index = index1.append([index2, index3, index0])
df = pd.DataFrame(list(range(1, len(index)+1)), index=index, columns=['a'])
df['time_hours'] = (df.index - df.index.normalize()) / pd.Timedelta(hours=1)
fig, ax = plt.subplots()
for n,g in df.groupby(df.index.normalize()):
ax.plot(g['time_hours'], g['a'], label=n, marker='o')
ax.legend(loc='best')
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to make line charts by iterating pandas columns in python? - python

Related

Problem with SARIMAX plotting in matplotlib [duplicate]

How to create a yearly bar plot grouped by months

How to make a line plot from a pandas dataframe with a long or wide format

How to select a specific date range in a csv file with pandas (again)?

plotting data for different days on a single HH:MM:SS axis

Categories

Resources