I am working on a timeseries plot from data that looks like the following:
import pandas as pd
data = {'index': [1, 34, 78, 900, 1200, 5000, 9001, 12000, 15234, 23432],
'rating': [90, 85, 89, 82, 78, 65, 54, 32, 39, 45],
'Year': [2005, 2005, 2005, 2006, 2006, 2006, 2007, 2008, 2009, 2009]}
df = pd.DataFrame(data)
The main issue is the lack of actual dates. I have plotted the data using the index order - the data is sorted in index-ascending order, the value of the index is meaningless.
I have plotted the data using
import plotly.express as px
fig = px.line(df, x='index', y='rating')
fig.show()
but would like to shade or label each year on the plot (could just be vertical dotted lines separating years, or alternated grey shades beneath the line but above the axis per year).
I am assuming that you have already sorted the DataFrame using the index column.
Here's a solution using bar (column) chart using matplotlib.
import matplotlib.pyplot as plt
import numpy as np
# [optional] create a dictionary of colors with year as keys. It is better if this is dynamically generated if you have a lot of years.
color_cycle = {'2005': 'red', '2006': 'blue', '2007': 'green', '2008': 'orange', '2009': 'purple'}
# I am assuming that the rating data is sorted by index already
# plot rating as a column chart using equal spacing on the x-axis
plt.bar(x=np.arange(len(df)), height=df['rating'], width=0.8, color=[color_cycle[str(year)] for year in df['Year']])
# add Year as x-axis labels
plt.xticks(np.arange(len(df)), df['Year'])
# add labels to the axes
plt.xlabel('Year')
plt.ylabel('Rating')
# display the plot
plt.show()
Outputs
Related
My code didn't work. pandas as pd and matplotlib.pyplot as plt were preloaded.
import plotly.express as px
# Set the figure style and initalize a new figure
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(12,8))
# Create a scatter plot of duration versus release_year
a = netflix_movies_col_subset["release_year"]
b = netflix_movies_col_subset["duration"]
c = colors
data1 = {"Release year": a, "Duration": b, "Rating": c}
df1 = pd.DataFrame(data1)
print(df1)
px.scatter(df1, x = "Release year", y = "Duration", color = "Rating")
# Create a title and axis labels
plt.title("Movie duration by year of release")
plt.xlabel("Release year")
plt.ylabel("Duration (min)")
# Show the plot
plt.show()
I am unsure why this didn't work on Anaconda:
# Set the figure style and initialize a new figure
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(12,8))
# Create a scatter plot of duration versus release_year
plt.scatter(netflix_movies_col_subset["release_year"],netflix_movies_col_subset["duration"])
# Create a title and axis labels
plt.title("Movie duration by year of release")
plt.xlabel("Release year")
plt.ylabel("Duration (min)")
Update code this way and you should be able to see the charts (minus the dummy data)... Also, not sure if you really need to move from netflix_movie_col_subset to df to df1, etc.
import plotly.express as px # Set the figure style and initalize a new figure plt.style.use('fivethirtyeight') fig = plt.figure(figsize=(12,8))
data = {'release_year': [2020, 2021, 1993, 2021, 2021, 2015, 2007, 2009, 2006, 2015],
'duration':[90, 91, 125, 104, 127, 96, 158, 88, 88, 111],
'Rating': ['black', 'red', 'blue', 'blue', 'black', 'black', 'blue', 'blue', 'black', 'black']}
netflix_movies_col_subset=pd.DataFrame(data)
# Create a scatter plot of duration versus release_year
a = netflix_movies_col_subset["release_year"]
b = netflix_movies_col_subset["duration"]
c = netflix_movies_col_subset["Rating"]
data1 = {"Release year": a, "Duration": b, "Rating": c}
df1 = pd.DataFrame(data1)
print(df1)
plt = px.scatter(df1, x = "Release year", y = "Duration", color = "Rating",
labels={"Release year": "Release year", # Not really required
"Duration": "Duration (min)",},
)
plt.update_layout(title_text='Movie duration by year of release', title_x=0.5)
# Show the plot
plt.show()
Output
I am trying to plot several ranges of my data into the same plot with plotly. For example the xaxis shall cover the ranges [0,14],[520,540] and [850,890]. However, when taking these data into one plot, there will be huge empty gaps inbetween the data regions, as the plot will simply cover the complete range from [0,890]. Due to this scaling the individual features of the data will be compresed to the point were nothing is discernible.
What I what to achieve is something like in this image:
Is it even possible to plot such a discontinued axis? If anybody also know how this discontinuation is called or if there is a name for that, I would be interested to hear it
Thanks to everyone
There's no direct way to do this with plotly. But with a little creativity you can easily make a setup that should come pretty close to what you're asking for. And save you from the pain of static matplotlib approaches. I'll spare you the details in case the following figure is not to your liking. But if it is, then I don't mind explaining the details. You'll find most of the details in the comments in the complete code snippet below though.
Plot 1:
Code 1:
# imports
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd
from plotly.subplots import make_subplots
# data
df1 = pd.DataFrame({'years': [1995, 1996, 1997, 1998, 1999, 2000,],
'China': [219, 146, 112, 127, 124, 180],
'Rest of world': [16, 13, 10, 11, 28, 37]}).set_index('years')
df2 = pd.DataFrame({'years': [2008, 2009, 2010, 2011, 2012, 2013,],
'China': [207, 236, 263,350, 430, 474],
'Rest of world': [43, 55, 56, 88, 105, 156]}).set_index('years')
df3 = pd.DataFrame({'years': [2017, 2018, 2019, 2020, 2021, 2022],
'China': [488, 537, 500, 439, 444, 555],
'Rest of world': [299, 340, 403, 549, 300, 311]}).set_index('years')
# df.set_index('years', inplace = True)
# organize datafames with different x-axes in a dict
dfs = {'df1': df1,
'df2': df2,
'df3': df3}
# subplot setup
colors = px.colors.qualitative.Plotly
fig = make_subplots(rows=1, cols=len(dfs.keys()), horizontal_spacing = 0.02)
fig.update_layout(title = "Broken / discontinued / gapped x-axis")
# Assign columns from dataframes in dict to the correct subplot
for i, dfd in enumerate(dfs, start =1):
for j, col in enumerate(dfs[dfd].columns):
fig.add_trace(go.Scatter(x=dfs[dfd].index,
y=dfs[dfd][col],
name=col,
marker_color=colors[j],
legendgroup = col,
showlegend = True if i == 1 else False,
), row=1, col=i)
# this section is made specifically for this dataset
# and this number of dataframes but can easily
# be made flexible wrt your data if this setup
# if something you can use
fig.update_yaxes(range=[0, 750])
fig.update_yaxes(showticklabels=False, row=1, col=2)
fig.update_yaxes(showticklabels=False, row=1, col=3)
# and just a little aesthetic adjustment
# that's admittedly a bit more challenging
# to automate...
# But entirely possible =D
fig.add_shape(type="line",
x0=0.31, y0=-0.01, x1=0.33, y1=0.01,
line=dict(color="grey",width=1),
xref = 'paper',
yref = 'paper'
)
fig.add_shape(type="line",
x0=0.32, y0=-0.01, x1=0.34, y1=0.01,
line=dict(color="grey",width=1),
xref = 'paper',
yref = 'paper'
)
fig.add_shape(type="line",
x0=0.66, y0=-0.01, x1=0.68, y1=0.01,
line=dict(color="grey",width=1),
xref = 'paper',
yref = 'paper'
)
fig.add_shape(type="line",
x0=0.67, y0=-0.01, x1=0.69, y1=0.01,
line=dict(color="grey",width=1),
xref = 'paper',
yref = 'paper'
)
fig.update_layout(template='plotly_white')
fig.show()
I have a simple pandas DataFrame as shown below. I want to create a scatter plot of value on the y-axis, date on the x-axis, and color the points by category. However, coloring the points isn't working.
# Create dataframe
df = pd.DataFrame({
'date': ['2016-01-01', '2016-02-01', '2016-03-01', '2016-01-01', '2016-02-01', '2016-03-01'],
'category': ['Wholesale', 'Wholesale', 'Wholesale', 'Retail', 'Retail', 'Retail'],
'value': [50, 60, 65, 55, 62, 70]
})
df['date'] = pd.to_datetime(df['date'])
# Try to plot
df.plot.scatter(x='date', y='value', c='category')
ValueError: 'c' argument must be a mpl color, a sequence of mpl colors or a sequence of numbers, not ['Wholesale' 'Wholesale' 'Wholesale' 'Retail' 'Retail' 'Retail'].
Why am a I getting the error? Pandas scatter plot documentation says the argument c can be "A column name or position whose values will be used to color the marker points according to a colormap."
df.plot.scatter(x='date', y='value', c=df['category'].map({'Wholesale':'red','Retail':'blue'}))
I think you are looking at seaborn:
import seaborn as sns
sns.scatterplot(data=df, x='date', y='value', hue='category')
Output:
Or you can loop through df.groupby:
fig, ax = plt.subplots()
for cat, d in df.groupby('category'):
ax.scatter(x=d['date'],y=d['value'], label=cat)
Output:
I have this table of data:
Date, City, Sales, PSale1, PSale2
Where PSale1, and PSale2 are predicted prices for sales.
The data set looks like this:
Date,City,Sales,PSale1,PSale2
01/01,NYC,200,300,178
01/01,SF,300,140,100
02/01,NYC,400,410,33
02/01,SF,42,24,60
I want to plot this data in seaborn, I was working on to plot a graph which will have date on the x-axis, and PSale1 and PSale2 as bars for each city on y-axis, but I am confused how to work with such a data.
As much as I know seaborn doesn't allow plotting two variables on any axis, how should I approach this situation?
Convert the dataframe to a long format
Plot using seaborn.FacetGrid
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = {'Date': ['01/01', '01/01', '02/01', '02/01'],
'City': ['NYC', 'SF', 'NYC', 'SF'],
'Sales': [200, 300, 400, 42],
'PSale1': [300, 140, 410, 24],
'PSale2': [178, 100, 33, 60]}
df = pd.DataFrame(data)
# convert to long
dfl = df.set_index(['Date', 'City', 'Sales']).stack().reset_index().rename(columns={'level_3': 'pred', 0: 'value'})
# plot
g = sns.FacetGrid(dfl, row='City')
g.map(sns.barplot, 'Date', 'value', 'pred').add_legend()
plt.show()
All in one figure
# shape the dataframe
dfc = df.drop(columns=['Sales']).set_index(['Date', 'City']).stack().unstack(level=1)
dfc.columns.name = None
# plot
dfc.plot.bar(stacked=True)
plt.xlabel('Date: Predicted')
plt.show()
I'm trying to plot sales and expenses values (on y-axis) over years (on x-axis) as given below. I'm expecting that the following code will set 2004, 2005, 2006 and 2007 as x-axis values. But, it is not showing as expected. See the image attached below. Let me know how to set the years values correctly on x-axis.
import matplotlib.pyplot as plt
years = [2004, 2005, 2006, 2007]
sales = [1000, 1170, 660, 1030]
expenses = [400, 460, 1120, 540]
plt.plot(years, sales)
plt.plot(years, expenses)
plt.show()
This would also do the work, in a different way:
fig=plt.figure()
ax = fig.add_subplot(111)
years = [2004, 2005, 2006, 2007]
sales = [1000, 1170, 660, 1030]
ax.plot(years, sales)
ax.xaxis.set_major_formatter(matplotlib.ticker.FormatStrFormatter('%d'))
plt.show()
The following piece of code should work for you
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime as dt
years = [2004, 2005, 2006, 2007]
sales = [1000, 1170, 660, 1030]
expenses = [400, 460, 1120, 540]
x = [dt.datetime.strptime(str(d),'%Y').date() for d in years]
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
plt.gca().xaxis.set_major_locator(mdates.YearLocator())
plt.plot(x, sales)
plt.plot(x, expenses)
plt.gcf().autofmt_xdate()
plt.show()