How to extract appropriate data in Plotly Grouped Bar Chart? - python

Well, I am trying to plot a Bar Graph in Plotly where I need to show 3 years of data in a grouped bar chart though the chart is displaying the data in the chart not showing data correctly all the bars are equal in the graph Something like this:
Here is my code for plotting:
data=[go.Bar(x=nasdaq['Sector'],y=recent_ipos['IPO Year'],textangle=-45,name='2015'),
go.Bar(x=nasdaq['Sector'],y=recent_ipos['IPO Year'],textangle=-45,name='2016'),
go.Bar(x=nasdaq['Sector'],y=recent_ipos['IPO Year'],textangle=-45,name='2017')
]
layout=go.Layout(title="NASDAQ Market Capitalization IPO yEAR (million USD)",barmode='group')
fig=go.Figure(data=data,layout=layout)
fig.show(renderer="colab")
Here is my code which I am using to extract the data for 3 years:
recent_ipos = nasdaq[nasdaq['IPO Year'] > 2014]
recent_ipos['IPO Year'] = recent_ipos['IPO Year'].astype(int)
I tried to extract the 2015 data here using an array but I don't find an appropriate method here to extract an element from the array
ipo2015=np.array(recent_ipos['IPO Year'])
ipo2015
I am not sure if this is the right way to extract the particular year data or not??
Things I want to know here are :
How to extract year data appropriately in the graph using Plotly?
What changes I should make to solve this inconsistency in the graph?
What should I put in Y= in all the three groups bars??
How to extract the years dynamically rather than manually?
Hope to receive help from this amazing community on StackOverflow.
Thanks in advance.!!

I wrote the code under the assumption that the data on which the question is based is in data frame format. The data is taken from plotly. The query() can also be used as a variable using # as shown in the code.
import plotly.graph_objects as go
import plotly.express as px
# yyyy = [1992,1997,2002]
df = px.data.gapminder()
continent = df['continent'].unique().tolist()
yyyy = df['year'].unique().tolist()[-3:] # update
data = []
for y in yyyy:
# tmp_df = df.query('year == #y')
tmp_df = df[df['year'] == y].groupby('continent')['pop'].sum()
data.append(go.Bar(x=tmp_df.index, y=tmp_df, name=y))
# Change the bar mode
fig = go.Figure(data)
fig.update_layout(barmode='group')
fig.show()

Related

python plotly (px) animation frame date is in wrong order

With plotly express I've built a bar chart similar to as shown on their website.
As px.bar did not allow me to run the animation frame on datetime64[ns] I transformed the datetime into a string as follows.
eu_vaccine_df['date_str'] = eu_vaccine_df['date'].apply(lambda x: str(x))
eu_vaccine_df[['date_str', 'date', 'country', 'people_vaccinated_per_hundred']].head()
The dataset on which I then run the px.bar looks as follows and contains 30 different countries.
The code for my barchart including animation looks as follows.
fig = px.bar(
eu_vaccine_df,
x='country', y='people_vaccinated_per_hundred',
color='country',
animation_frame='date_str',
animation_group='country',
hover_name='country',
range_y=[0,50],
range_x=[0,30]
)
fig.update_layout(
template='plotly_dark',
margin=dict(r=10, t=25, b=40, l=60)
)
fig.show()
In the end result the date on the animation frame is wrong. It first shows all results from 2021 and then all results from 2020 as shown at the bottom of the following screenshot.
Sorting my df by the date solved the issue.
covid_df['date'] = pd.to_datetime(covid_df['date'])
covid_df = covid_df.sort_values('date', ascending=True)
covid_df['date'] = covid_df['date'].dt.strftime('%m-%d-%Y')

Stacked Area Chart in Python

I'm working on an assignment from school, and have run into a snag when it comes to my stacked area chart.
The data is fairly simple: 4 columns that look similar to this:
Series id
Year
Period
Value
LNS140000
1948
M01
3.4
I'm trying to create a stacked area chart using Year as my x and Value as my y and breaking it up over Period.
#Stacked area chart still using unemployment data
x = d.Year
y = d.Value
plt.stackplot(x, y, labels = d['Period'])
plt.legend(d['Period'], loc = 'upper left')
plt.show()enter code here`
However, when I do it like this it only picks up M01 and there are M01-M12. Any thoughts on how I can make this work?
You need to preprocess your data a little before passing them to the stackplot function. I took a look at this link to work on an example that could be suitable for your case.
Since I've seen one row of your data, I add some random values to the dataset.
import pandas as pd
import matplotlib.pyplot as plt
dd=[[1948,'M01',3.4],[1948,'M02',2.5],[1948,'M03',1.6],
[1949,'M01',4.3],[1949,'M02',6.7],[1949,'M03',7.8]]
d=pd.DataFrame(dd,columns=['Year','Period','Value'])
years=d.Year.unique()
periods=d.Period.unique()
#Now group them per period, but in year sequence
d.sort_values(by='Year',inplace=True) # to ensure entire dataset is ordered
pds=[]
for p in periods:
pds.append(d[d.Period==p]['Value'].values)
plt.stackplot(years,pds,labels=periods)
plt.legend(loc='upper left')
plt.show()
Is that what you want?
So I was able to use Seaborn to help out. First I did a pivot table
df = d.pivot(index = 'Year',
columns = 'Period',
values = 'Value')
df
Then I set up seaborn
plt.style.use('seaborn')
sns.set_style("white")
sns.set_theme(style = "ticks")
df.plot.area(figsize = (20,9))
plt.title("Unemployment by Year and Month\n", fontsize = 22, loc = 'left')
plt.ylabel("Values", fontsize = 22)
plt.xlabel("Year", fontsize = 22)
It seems to me that the problem you are having relates to the formatting of the data. Look how the values are formatted in this matplotlib example. I would try to groupby the data by period, or pivot it in the correct format, and then graphing again.

How to make a line plot from a dataframe with multiple categorical columns in matplotlib

I want to make line chart for the different categories where one is a different country, and one is a different country for weekly based line charts. Initially, I was able to draft line plots using seaborn but it is not quite handy like setting its label, legend, color palette and so on. I am wondering is there any way to easily reshape this data with multiple categorical variables and render line charts. In initial attempt, I tried seaborn.relplot but it is not easy to tune its parameter and hard to customize the resulted plot. Can anyone point me to any efficient way to reshape dataframe with multiple categorical columns and render a clear line chart? Any thoughts?
reproducible data & my attempt:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://gist.githubusercontent.com/adamFlyn/cb0553e009933574ac7ec3109ffb5140/raw/a277bc00dc08e526a7d5b7ead5425905f7206bfa/export.csv'
dff = pd.read_csv(url, parse_dates=['weekly'])
dff.drop('Unnamed: 0', axis=1, inplace=True)
df2_bf = dff.groupby(['destination', 'weekly'])['FCF_Beef'].sum().unstack()
df2_bf = df2_bf.fillna(0)
mm = df2_bf.T
mm.columns.name = None
mm = mm[~(mm.isna().sum(1)/mm.shape[1]).gt(0.9)].fillna(0)
#Total sum per column:
mm.loc['Total',:]= mm.sum(axis=0)
mm1 = mm.T
mm1 = mm1.nlargest(6, columns=['Total'])
mm1.drop('Total', axis=1, inplace=True)
mm2 = mm1.T
mm2.reset_index(inplace=True)
mm2['weekly'] = pd.to_datetime(mm2['weekly'])
mm2['year'] = mm2['weekly'].dt.year
mm2['week'] = mm2['weekly'].dt.isocalendar().week
df = mm2.melt(id_vars=['weekly','week','year'], var_name='country')
df_ = df.groupby(['country', 'year', 'week'], as_index=False)['value'].sum()
sns.relplot(data=df_, x='week', y='value', hue='year', row='country', kind='line', height=6, aspect=2, facet_kws={'sharey': False, 'sharex': False}, sizes=(20, 10))
current plot
this is one of current plot that I made with seaborn.relplot
structure of plot is okay for me, but in seaborn.replot, it is hard to tune parameter and it is as flexible as using matplotlib. Also, I realized that the way of aggregating my data is not very efficient. I think there might be a shortcut to make the above code snippet more efficient like:
plt_data = []
for i in dff.loc[:, ['FCF_Beef','FCF_Beef']]:
...
but doing this way I faced a couple of issues to make the right plot. Can anyone point me out how to make this simple and efficient in order to make the expected line chart with matplotlib? Does anyone know any better way of doing this? Any idea? Thanks
desired output
In my desired plot, first I need to iterate list of countries, where each country has one subplot, in each subplot, x-axis shows 52 weeks and y-axis shows weeklyExport amount of different years for each country. Here is draft plot that I made with seaborn.relplot.
note that, I don't like the output from seaborn.relplot, so I am wondering how can I make above attempt more efficient with matplotlib attempt. Any idea?
As requested by the OP, following is an iterative way to plot the data.
The following example plots each year, for a given 'destination' in a single figure
This is similar to the answer for this question.
import pandas as pd
import matplotlib.pyplot as plt
# load the data
url = 'https://gist.githubusercontent.com/adamFlyn/cb0553e009933574ac7ec3109ffb5140/raw/a277bc00dc08e526a7d5b7ead5425905f7206bfa/export.csv'
df = pd.read_csv(url, parse_dates=['weekly'], usecols=range(1, 6))
# groupby destination and iterate through for plotting
for g, d in df.groupby(['destination']):
# create the figure
fig, ax = plt.subplots(figsize=(7, 4))
# add lines for specific years
for year in d.weekly.dt.year.unique():
data = d[d.weekly.dt.year == year].copy() # select the data from d, by year
data['week'] = data.weekly.dt.isocalendar().week # create a week column
data.sort_values('weekly', inplace=True)
display(data.head()) # display is for jupyter, if it causes an error, use pring
data.plot(x='week', y='FCF_Beef', ax=ax, label=year)
plt.show()
Single sample plot
If we look at the tail of one of the dataframes, data.weekly.dt.isocalendar().week as putting the last day of the year as week 1, so a line is drawn back to the last data point being placed at week 1.
This function rests on datetime.datetime(2018, 12, 31).isocalendar() and is the expected behavior from the datetime module, as per this closed pandas bug.
Removing the last row with .iloc[:-1, :], is a work around
Alternatively, replace data['week'] = data.weekly.dt.isocalendar().week with data['week'] = data.weekly.dt.strftime('%W').astype('int')
data.iloc[:-1, :].plot(x='week', y='FCF_Beef', ax=ax, label=year)
Updated with all code from OP
# load the data
url = 'https://gist.githubusercontent.com/adamFlyn/cb0553e009933574ac7ec3109ffb5140/raw/a277bc00dc08e526a7d5b7ead5425905f7206bfa/export.csv'
dff = pd.read_csv(url, parse_dates=['weekly'], usecols=range(1, 6))
df2_bf = dff.groupby(['destination', 'weekly'])['FCF_Beef'].sum().unstack()
df2_bf = df2_bf.fillna(0)
mm = df2_bf.T
mm.columns.name = None
mm = mm[~(mm.isna().sum(1)/mm.shape[1]).gt(0.9)].fillna(0)
#Total sum per column:
mm.loc['Total',:]= mm.sum(axis=0)
mm1 = mm.T
mm1 = mm1.nlargest(6, columns=['Total'])
mm1.drop('Total', axis=1, inplace=True)
mm2 = mm1.T
mm2.reset_index(inplace=True)
mm2['weekly'] = pd.to_datetime(mm2['weekly'])
mm2['year'] = mm2['weekly'].dt.year
mm2['week'] = mm2['weekly'].dt.strftime('%W').astype('int')
df = mm2.melt(id_vars=['weekly','week','year'], var_name='country')
# groupby destination and iterate through for plotting
for g, d in df.groupby(['country']):
# create the figure
fig, ax = plt.subplots(figsize=(7, 4))
# add lines for specific years
for year in d.weekly.dt.year.unique():
data = d[d.weekly.dt.year == year].copy() # select the data from d, by year
data.sort_values('weekly', inplace=True)
display(data.head()) # display is for jupyter, if it causes an error, use pring
data.plot(x='week', y='value', ax=ax, label=year, title=g)
plt.show()

Plotly x-axis dates are displayed as the date plus one day

I'm using plotly to create a stacked bar chart, with each bar representing a quarter end date. The data is pulled into a dataframe via SQL and the dates are parsed in the read_sql statement.
When graphing the dates on the x-axis are displayed as 10/01 instead of 9/30, 4/1 instead of 3/31, etc.
Any idea how I can just display the dates correctly?
Here's a sample
import plotly.express as px
fig = px.bar(df.groupby('dt_quarter').head(10), x='dt_quarter', y="amount", color="name", title="Stack Bar Test")
fig.update_layout(yaxis_title_text = 'Amount ($)',xaxis_title_text='Date', legend_title_text='Sector', legend_traceorder='reversed')
fig.show()
What I ended up doing was creating a new column in my dataframe that displays the date in 'QXYYYY' format (e.g. Q42020, etc.). I then used that as my x axis and it works fine.
For creating the new column:
alldata['quarter'] = pd.PeriodIndex(alldata.dt_quarter, freq='Q').astype('str')

Plotly: How to style a plotly figure so that it doesn't display gaps for missing dates?

I have a plotly graph of the EUR/JPY exchange rate across a few months in 15 minute time intervals, so as a result, there is no data from friday evenings to sunday evenings.
Here is a portion of the data, note the skip in the index (type: DatetimeIndex) over the weekend:
Plotting this data in plotly results in a gap over the missing dates Using the dataframe above:
import plotly.graph_objs as go
candlesticks = go.Candlestick(x=data.index, open=data['Open'], high=data['High'],
low=data['Low'], close=data['Close'])
fig = go.Figure(layout=cf_layout)
fig.add_trace(trace=candlesticks)
fig.show()
Ouput:
As you can see, there are gaps where the missing dates are. One solution I've found online is to change the index to text using:
data.index = data.index.strftime("%d-%m-%Y %H:%M:%S")
and plotting it again, which admittedly does work, but has it's own problem. The x-axis labels look atrocious:
I would like to produce a graph that plots a graph like in the second plot where there are no gaps, but the x-axis is displayed like as it is on the first graph. Or at least displayed in a much more concise and responsive format, as close to the first graph as possible.
Thank you in advance for any help!
Even if some dates are missing in your dataset, plotly interprets your dates as date values, and shows even missing dates on your timeline. One solution is to grab the first and last dates, build a complete timeline, find out which dates are missing in your original dataset, and include those dates in:
fig.update_xaxes(rangebreaks=[dict(values=dt_breaks)])
This will turn this figure:
Into this:
Complete code:
import plotly.graph_objects as go
from datetime import datetime
import pandas as pd
import numpy as np
# sample data
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/finance-charts-apple.csv')
# remove some dates to build a similar case as in the question
df = df.drop(df.index[75:110])
df = df.drop(df.index[210:250])
df = df.drop(df.index[460:480])
# build complete timepline from start date to end date
dt_all = pd.date_range(start=df['Date'].iloc[0],end=df['Date'].iloc[-1])
# retrieve the dates that ARE in the original datset
dt_obs = [d.strftime("%Y-%m-%d") for d in pd.to_datetime(df['Date'])]
# define dates with missing values
dt_breaks = [d for d in dt_all.strftime("%Y-%m-%d").tolist() if not d in dt_obs]
# make fiuge
fig = go.Figure(data=[go.Candlestick(x=df['Date'],
open=df['AAPL.Open'], high=df['AAPL.High'],
low=df['AAPL.Low'], close=df['AAPL.Close'])
])
# hide dates with no values
fig.update_xaxes(rangebreaks=[dict(values=dt_breaks)])
fig.update_layout(yaxis_title='AAPL Stock')
fig.show()
Just in case someone here wants to remove gaps for outside trading hours and weekends,
As shown below, using rangebreaks is the way to do it.
fig = go.Figure(data=[go.Candlestick(x=df['date'], open=df['Open'], high=df['High'], low=df['Low'], close=df['Close'])])
fig.update_xaxes(
rangeslider_visible=True,
rangebreaks=[
# NOTE: Below values are bound (not single values), ie. hide x to y
dict(bounds=["sat", "mon"]), # hide weekends, eg. hide sat to before mon
dict(bounds=[16, 9.5], pattern="hour"), # hide hours outside of 9.30am-4pm
# dict(values=["2020-12-25", "2021-01-01"]) # hide holidays (Christmas and New Year's, etc)
]
)
fig.update_layout(
title='Stock Analysis',
yaxis_title=f'{symbol} Stock'
)
fig.show()
here's Plotly's doc.
thanks for the amazing sample! works on daily data but with intraday / 5min data rangebreaks only leave one day on chart
# build complete timepline
dt_all = pd.date_range(start=df.index[0],end=df.index[-1], freq="5T")
# retrieve the dates that ARE in the original datset
dt_obs = [d.strftime("%Y-%m-%d %H:%M:%S") for d in pd.to_datetime(df.index, format="%Y-%m-%d %H:%M:%S")]
# define dates with missing values
dt_breaks = [d for d in dt_all.strftime("%Y-%m-%d %H:%M:%S").tolist() if not d in dt_obs]
To fix problem with intraday data, you can use the dvalue parameter of rangebreak with the right ms value.
For example, 1 hour = 3.6e6 ms, so use dvalue with this value.
Documentation here : https://plotly.com/python/reference/layout/xaxis/
fig.update_xaxes(rangebreaks=[dict(values=dt_breaks, dvalue=3.6e6)])

Categories

Resources