I have this dataframe Gas Price Brazil /
Data Frame
I get only the gasoline values from this DF and want to plot the average price (PREÇO MEDIO) over time (YEARS - ANO) from each region (REGIAO)
I used Seaborn with HUE and get this:
But when I try to plot the same thing at Plotly the result is:
How can I get the same plot with plotly?
I searched and find this: Seaborn Hue on Plotly
But this didn't work to me.
The answer:
You will achieve the same thing using plotly express and the color attribute:
fig = px.line(dfm, x="dates", y="value", color='variable')
The details:
You haven't described the structure of your data in detail, but assigning hue like this is normally meant to be applied to a data structure such as...
Date Variable Value
01.01.2020 A 100
01.01.2020 B 90
01.02.2020 A 110
01.02.2020 B 120
... where a unique hue or color is assigned to different variable names that are associated with a timestamp column where each timestamp occurs as many times as there are variables.
And that seems to be the case for seaborn too:
hue : name of variables in data or vector data, optional
Grouping variable that will produce points with different colors. Can
be either categorical or numeric, although color mapping will behave
differently in latter case.
You can achieve the same thing with plotly using the color attribute in go.Scatter(), but it seems that you could make good use of plotly.express too. Until you've provided a proper data sample, I'll show you how to do it using some sampled data in a dataframe using numpy and pandas.
Plot:
Code:
# imports
import numpy as np
import pandas as pd
import plotly.express as px
# sample time series data
np.random.seed(123)
df = pd.DataFrame(np.random.randint(-10,12,size=(50, 4)), columns=list('ABCD'))
datelist = pd.date_range(pd.datetime(2020, 1, 1).strftime('%Y-%m-%d'), periods=50).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
df.iloc[0]=0
df=df.cumsum().reset_index()
# melt data to provide the data structure mentioned earlier
dfm=pd.melt(df, id_vars=['dates'], value_vars=df.columns[1:])
dfm.set_index('dates')
dfm.head()
# plotly
fig = px.line(dfm, x="dates", y="value", color='variable')
fig.show()
Related
I have a multi index dataframe, with the two indices being Sample and Lithology
Sample 20EC-P 20EC-8 20EC-10-1 ... 20EC-43 20EC-45 20EC-54
Lithology Pd Di-Grd Gb ... Hbl Plag Pd Di-Grd Gb
Rb 7.401575 39.055118 6.456693 ... 0.629921 56.535433 11.653543
Ba 24.610102 43.067678 10.716841 ... 1.073115 58.520532 56.946630
Th 3.176471 19.647059 3.647059 ... 0.823529 29.647059 5.294118
I am trying to put it into a seaborn lineplot as such.
spider = sns.lineplot(data = data, hue = data.columns.get_level_values("Lithology"),
style = data.columns.get_level_values("Sample"),
dashes = False, palette = "deep")
The lineplot comes out as
1
I have two issues. First, I want to format hues by lithology and style by sample. Outside of the lineplot function, I can successfully access sample and lithology using data.columns.get_level_values, but in the lineplot they don't seem to do anything and I haven't figured out another way to access these values. Also, the lineplot reorganizes the x-axis by alphabetical order. I want to force it to keep the same order as the dataframe, but I don't see any way to do this in the documentation.
To use hue= and style=, seaborn prefers it's dataframes in long form. pd.melt() will combine all columns and create new columns with the old column names, and a column for the values. The index too needs to be converted to a regular column (with .reset_index()).
Most seaborn functions use order= to set an order on the x-values, but with lineplot the only way is to make the column categorical applying a fixed order.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
column_tuples = [('20EC-P', 'Pd '), ('20EC-8', 'Di-Grd'), ('20EC-10-1 ', 'Gb'),
('20EC-43', 'Hbl Plag Pd'), ('20EC-45', 'Di-Grd'), ('20EC-54', 'Gb')]
col_index = pd.MultiIndex.from_tuples(column_tuples, names=["Sample", "Lithology"])
data = pd.DataFrame(np.random.uniform(0, 50, size=(3, len(col_index))), columns=col_index, index=['Rb', 'Ba', 'Th'])
data_long = data.melt(ignore_index=False).reset_index()
data_long['index'] = pd.Categorical(data_long['index'], data.index) # make categorical, use order of the original dataframe
ax = sns.lineplot(data=data_long, x='index', y='value',
hue="Lithology", style="Sample", dashes=False, markers=True, palette="deep")
ax.set_xlabel('')
ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1.02))
plt.tight_layout() # fit legend and labels into the figure
plt.show()
The long dataframe looks like:
index Sample Lithology value
0 Rb 20EC-P Pd 6.135005
1 Ba 20EC-P Pd 6.924961
2 Th 20EC-P Pd 44.270570
...
(This is a self-answered post to help others shorten their answers to plotly questions by not having to explain how plotly best handles data of long and wide format)
I'd like to build a plotly figure based on a pandas dataframe in as few lines as possible. I know you can do that using plotly.express, but this fails for what I would call a standard pandas dataframe; an index describing row order, and column names describing the names of a value in a dataframe:
Sample dataframe:
a b c
0 100.000000 100.000000 100.000000
1 98.493705 99.421400 101.651437
2 96.067026 98.992487 102.917373
3 95.200286 98.313601 102.822664
4 96.691675 97.674699 102.378682
An attempt:
fig=px.line(x=df.index, y = df.columns)
This raises an error:
ValueError: All arguments should have the same length. The length of argument y is 3, whereas the length of previous arguments ['x'] is 100`
Here you've tried to use a pandas dataframe of a wide format as a source for px.line.
And plotly.express is designed to be used with dataframes of a long format, often referred to as tidy data (and please take a look at that. No one explains it better that Wickham). Many, particularly those injured by years of battling with Excel, often find it easier to organize data in a wide format. So what's the difference?
Wide format:
data is presented with each different data variable in a separate column
each column has only one data type
missing values are often represented by np.nan
works best with plotly.graphobjects (go)
lines are often added to a figure using fid.add_traces()
colors are normally assigned to each trace
Example:
a b c
0 -1.085631 0.997345 0.282978
1 -2.591925 0.418745 1.934415
2 -5.018605 -0.010167 3.200351
3 -5.885345 -0.689054 3.105642
4 -4.393955 -1.327956 2.661660
5 -4.828307 0.877975 4.848446
6 -3.824253 1.264161 5.585815
7 -2.333521 0.328327 6.761644
8 -3.587401 -0.309424 7.668749
9 -5.016082 -0.449493 6.806994
Long format:
data is presented with one column containing all the values and another column listing the context of the value
missing values are simply not included in the dataset.
works best with plotly.express (px)
colors are set by a default color cycle and are assigned to each unique variable
Example:
id variable value
0 0 a -1.085631
1 1 a -2.591925
2 2 a -5.018605
3 3 a -5.885345
4 4 a -4.393955
... ... ... ...
295 95 c -4.259035
296 96 c -5.333802
297 97 c -6.211415
298 98 c -4.335615
299 99 c -3.515854
How to go from wide to long?
df = pd.melt(df, id_vars='id', value_vars=df.columns[:-1])
The two snippets below will produce the very same plot:
How to use px to plot long data?
fig = px.line(df, x='id', y='value', color='variable')
How to use go to plot wide data?
colors = px.colors.qualitative.Plotly
fig = go.Figure()
fig.add_traces(go.Scatter(x=df['id'], y = df['a'], mode = 'lines', line=dict(color=colors[0])))
fig.add_traces(go.Scatter(x=df['id'], y = df['b'], mode = 'lines', line=dict(color=colors[1])))
fig.add_traces(go.Scatter(x=df['id'], y = df['c'], mode = 'lines', line=dict(color=colors[2])))
fig.show()
By the looks of it, go is more complicated and offers perhaps more flexibility? Well, yes. And no. You can easily build a figure using px and add any go object you'd like!
Complete go snippet:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
# dataframe of a wide format
np.random.seed(123)
X = np.random.randn(100,3)
df=pd.DataFrame(X, columns=['a','b','c'])
df=df.cumsum()
df['id']=df.index
# plotly.graph_objects
colors = px.colors.qualitative.Plotly
fig = go.Figure()
fig.add_traces(go.Scatter(x=df['id'], y = df['a'], mode = 'lines', line=dict(color=colors[0])))
fig.add_traces(go.Scatter(x=df['id'], y = df['b'], mode = 'lines', line=dict(color=colors[1])))
fig.add_traces(go.Scatter(x=df['id'], y = df['c'], mode = 'lines', line=dict(color=colors[2])))
fig.show()
Complete px snippet:
import numpy as np
import pandas as pd
import plotly.express as px
from plotly.offline import iplot
# dataframe of a wide format
np.random.seed(123)
X = np.random.randn(100,3)
df=pd.DataFrame(X, columns=['a','b','c'])
df=df.cumsum()
df['id']=df.index
# dataframe of a long format
df = pd.melt(df, id_vars='id', value_vars=df.columns[:-1])
# plotly express
fig = px.line(df, x='id', y='value', color='variable')
fig.show()
I'm going to add this as answer so it will be on evidence.
First of all thank you #vestland for this. It's a question that come over and over so it's good to have this addressed and it could be easier to flag duplicated question.
Plotly Express now accepts wide-form and mixed-form data
as you can check in this post.
You can change the pandas plotting backend to use plotly:
import pandas as pd
pd.options.plotting.backend = "plotly"
Then, to get a fig all you need to write is:
fig = df.plot()
fig.show() displays the above image.
In my dataset I have a categorical column named 'Type'contain(eg.,INVOICE,IPC,IP) and 'Date' column contain dates(eg,2014-02-01).
how can I plot these two.
On x axis I want date
On y axis a line for (eg.INVOCE) showing its trend
enter image description here
Not very sure what you mean by plot and show trend, one ways is to count like #QuangHoang suggested, and plot with a heatmap, something like below. If it is something different, please expand on your question.
import pandas as pd
import numpy as np
import seaborn as sns
dates = pd.date_range(start='1/1/2018', periods=5, freq='3M')[np.random.randint(0,5,20)]
type = np.random.choice(['INVOICE','IPC','IP'],20)
df = pd.DataFrame({'dates':dates ,'type':type})
tab = pd.crosstab(df['type'],df['dates'].dt.strftime('%d-%m-%Y'))
n = np.unique(tab.values)
cmap = sns.color_palette("BuGn_r",len(n))
sns.heatmap(tab,cmap=cmap)
I am willing to plot 3 timeseries on the same chart. Datasource is a pandas.DataFrame() object, the type of Timestamp being datetime.date, and the 3 different time series drawn from the same column Value using the color argument of plotly.express.line().
The 3 lines show on the chart, but each one is accompanied by some sort of trendline. I can't see in the function signature how to disable those trendlines. Can you please help?
I have made several attempts, e.g. using another color, but the trendlines just stay there.
Please find below the code snippet and the resulting chart.
import plotly.io as pio
import plotly.express as px
pio.renderers = 'jupyterlab'
fig = px.line(data_frame=df, x='Timestamp', y='Value', color='Position_Type')
fig.show()
(If relevant, I am using jupyterlab)
Timestamp on the screen appears like this (this are [regular] weekly timeseries) :
And, as per the type:
type(df.Timestamp[0])
> datetime.date
I am adding that it looks like the lines that I first thought were trendlines would rather be straight lines from the first datapoint to the last datapoint of each time series.
df_melt = df_melt.sort_values('datetime_id')
Sorting got rid of those "wrap-arounds". Thanks for the suggestions above. Using Plotly 4.8.2.
Introduction:
Your provided data sample is an image, and not very easy to work with, so I'm going to use some sampled random time series to offer a suggestion. The variables in your datasample don't match the ones you've used in px.Scatter either by the way.
I'm on plotly version '4.2.0' and unable to reproduce your issue. Hopefully you'll find this suggestion useful anyway.
Using data structured like this...
Timestamp Position_type value
145 2020-02-15 value3 86.418593
146 2020-02-16 value3 78.285128
147 2020-02-17 value3 79.665202
148 2020-02-18 value3 84.502445
149 2020-02-19 value3 91.287312
...I'm able to produce this plot...
...using this code:
# imports
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import pandas as pd
import numpy as np
# data
np.random.seed(123)
frame_rows = 50
n_plots = 2
frame_columns = ['V_'+str(e) for e in list(range(n_plots+1))]
df = pd.DataFrame(np.random.uniform(-10,10,size=(frame_rows, len(frame_columns))),
index=pd.date_range('1/1/2020', periods=frame_rows),
columns=frame_columns)
df=df.cumsum()+100
df.iloc[0]=100
df.reset_index(inplace=True)
df.columns=['Timestamp','value1', 'value2', 'value3' ]
varNames=df.columns[1:]
# melt dataframe with timeseries from wide to long format.
# YOUR dataset seems to be organized in a long format since
# you're able to set color using a variable name
df_long = pd.melt(df, id_vars=['Timestamp'], value_vars=varNames, var_name='Position_type', value_name='value')
#df_long.tail()
# plotly time
import plotly.io as pio
import plotly.express as px
#pio.renderers = 'jupyterlab'
fig = px.scatter(data_frame=df_long, x='Timestamp', y='value', color='Position_type')
#fig = px.line(data_frame=df_long, x='Timestamp', y='value', color='Position_type')
fig.show()
If you change...
px.scatter(data_frame=df_long, x='Timestamp', y='value', color='Position_type')
...to...
fig = px.line(data_frame=df_long, x='Timestamp', y='value', color='Position_type')
...you'll get this plot instead:
No trendlines as far as the eye can see.
Edit - I think I know what's going on...
Having taken a closer look at your figure, I've realized that those lines are not trendlines. A trendline doesn't normally start at the initial value of a series and end up at the last value of the series. And that's what happening here for all three series. So I think you've got some bad or duplicate timestamps somewhere.
Problem: I have timeseries data of several days and I use sns.FacetGrid function of Seaborn python library to plot this data in facet form. In several cases, I found that mentioned seaborn function plots consecutive missing values (nan values) between two readings with a continuous line. While as matplotlib shows missing values as a gap, which makes sense. A demo example is as
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# create timeseries data for 3 days such that day two contains NaN values
time_duration1 = pd.date_range('1/1/2018', periods=24,freq='H')
data1 = np.random.randn(len(time_duration1))
ds1 = pd.Series(data=data1,index=time_duration1)
time_duration2 = pd.date_range('1/2/2018',periods=24,freq='H')
data2 = [float('nan')]*len(time_duration2)
ds2 = pd.Series(data=data2,index=time_duration2)
time_duration3 = pd.date_range('1/3/2018', periods=24,freq='H')
data3 = np.random.randn(len(time_duration3))
ds3 = pd.Series(data=data3,index=time_duration3)
# combine all three days series and then convert series into pandas dataframe
DS = pd.concat([ds1,ds2,ds3])
DF = DS.to_frame()
DF.plot()
It results into following plot
Above Matplotlib plot shows missing values with a gap.
Now let us prepare same data for seaborn function as
DF['col'] = np.ones(DF.shape[0])# dummy column but required for facets
DF['timestamp'] = DF.index
DF.columns = ['data_val','col','timestamp']
g = sns.FacetGrid(DF,col='col',col_wrap=1,size=2.5)
g.map_dataframe(plt.plot,'timestamp','data_val')
See, how seaborn plot shows missing data with a line. How should I force seaborn to not plot nan values with such a line?
Note: This is a dummy example, and I need facet grid in any case to plot my data.
FacetGrid by default removes nan from the data. The reason is that some functions inside seaborn would not work properly with nans (especially some of the statistical function, I'd say).
In order to keep the nan values in the data, use the dropna=False argument to FacetGrid:
g = sns.FacetGrid(DF,... , dropna=False)