Sankey Plot not Showing in Jupyter Notebook - python

I'm pretty sure my code is fine, btu I can't generate a plot of a simple Sankey Chart. Maybe something is off with the code, not sure. Here's what I have now. Can anyone see a problem with this?
import pandas as pd
import holoviews as hv
import plotly.graph_objects as go
import plotly.express as pex
hv.extension('bokeh')
data = [['TMD','TMD Create','Sub-Section 1',17],['TMD','TMD Create','Sub-Section 1',17],['C4C','Customer Tab','Sub-Section 1',10],['C4C','Customer Tab','Sub-Section 1',10],['C4C','Customer Tab','Sub-Section 1',17]]
df = pd.DataFrame(data, columns=['Source','Target','Attribute','Value'])
df
source = df["Source"].values.tolist()
target = df["Target"].values.tolist()
value = df["Value"].values.tolist()
labels = df["Attribute"].values.tolist()
import plotly.graph_objs as go
#create links
link = dict(source=source, target=target, value=value,
color=["turquoise","tomato"] * len(source))
#create nodes
node = dict(label=labels, pad=15, thickness=5)
#create a sankey object
chart = go.Sankey(link=link, node=node, arrangement="snap")
#build a figure
fig = go.Figure(chart)
fig.show()
I am trying to follow the basic example shown in the link below.
https://python.plainenglish.io/create-a-sankey-diagram-in-python-e09e23cb1a75

You are mentioning two different packages, and both need different solutions. I don't know which you perefer, so I explain both.
Data
import pandas as pd
df = pd.DataFrame({
'Source':['a','a','b','b'],
'Target':['c','d','c','d'],
'Value': [1,2,3,4]
})
>>> df
Source Target Value
0 a c 1
1 a d 2
2 b c 3
3 b d 4
This is a very basic DataFrame with only 4 transitions.
Holoviews/Bokeh
With holoviews it is very easy to plot a sanky diagram, because it takes the DataFrame as it is and gets the labels by the letters in the Source and Target column.
import holoviews as hv
hv.extension('bokeh')
sankey = hv.Sankey(df)
sankey.opts(width=600, height=400)
This is created with holoviews 1.15.4 and bokeh 2.4.3.
Plotly
For plotly it is not so easy, because plotly wants numbers instead of letters in the Source and Target column. Therefor we have to manipulate the DataFrame first before we can create the figure.
Here I collect all different labels and replace them by a unique number.
unique_labels = set(list(df['Source'].unique()) + list(df['Target'].unique()))
mapper = {v: i for i, v in enumerate(unique_labels)}
df['Source'] = df['Source'].map(mapper)
df['Target'] = df['Target'].map(mapper
>>> df
Source Target Value
0 0 2 1
1 0 3 2
2 1 2 3
3 1 3 4
Afterwards I can create the dicts which plotly takes. I have to set the lables by hand and the length of the arrays have to match.
source = df["Source"].values.tolist()
target = df["Target"].values.tolist()
value = df["Value"].values.tolist()
#create links
link = dict(source=source, target=target, value=value, color=["turquoise","tomato"] * 2)
#create nodes
node = dict(label=['a', 'b', 'c', 'd'], pad=15, thickness=5)
#create a sankey object
chart = go.Sankey(link=link, node=node, arrangement="snap")
#build a figure
fig = go.Figure(chart)
fig.show()
I used plotly 5.13.0.

Related

Plotly: How to display a regression line for one variable against multiple other time series?

With a dataset such as time series for various stocks, how can you easily display a regression line for one variable against all others and quickly define a few aesthetic elements such as:
which variable to plot against the others,
theme color for the figure,
colorscale for the traces
type of trendline; linear or non-linear?
Data:
date GOOG AAPL AMZN FB NFLX MSFT
100 2019-12-02 1.216280 1.546914 1.425061 1.075997 1.463641 1.720717
101 2019-12-09 1.222821 1.572286 1.432660 1.038855 1.421496 1.752239
102 2019-12-16 1.224418 1.596800 1.453455 1.104094 1.604362 1.784896
103 2019-12-23 1.226504 1.656000 1.521226 1.113728 1.567170 1.802472
104 2019-12-30 1.213014 1.678000 1.503360 1.098475 1.540883 1.788185
Reproducible through:
import pandas as pd
import plotly.express as px
df = px.data.stocks()
The essence:
target = 'GOOG'
fig = px.scatter(df, x = target,
y = [c for c in df.columns if c != target],
color_discrete_sequence = px.colors.qualitative.T10,
template = 'plotly_dark', trendline = 'ols',
title = 'Google vs. the world')
The details:
With the latest versions of plotly.express (px) and px.scatter, these things are both easy, straight-forward and flexible at the same time. The snippet below will do exactly as requested in the question.
First, define a target = 'GOOG from the dataframe columns. Then, using `px.scatter() you can:
Plot the rest of the columns against the target using y = [c for c in df.columns if c != target]
Select a theme through template='plotly_dark') or find another using pio.templates.
Select a color scheme for the traces through color_discrete_sequence = px.colors.qualitative.T10 or find another using dir(px.colors.qualitative)
Define trend estimation method through trendline = 'ols' or trendline = 'lowess'
(The following plot is made with a data soure of a wide format. With some very slight amendments, px.scatter() will handle data of a long format just as easily.)
Plot
Complete code:
# imports
import pandas as pd
import plotly.express as px
import plotly.io as pio
# data
df = px.data.stocks()
df = df.drop(['date'], axis = 1)
# your choices
target = 'GOOG'
colors = px.colors.qualitative.T10
# plotly
fig = px.scatter(df,
x = target,
y = [c for c in df.columns if c != target],
template = 'plotly_dark',
color_discrete_sequence = colors,
trendline = 'ols',
title = 'Google vs. the world')
fig.show()

Plot for sankey diagram is empty

Empty plot
I'm new to python/spyder, and I'm trying to make a sankey diagram, but no data is included in the plot, it's just empty. I found out that I need to convert from dataframe to lists, but this hasn't helped, and the plot is still empty. I kept it very simple, and pretty much copied it straight from a guide from plotly, and just imported my own data.
Here is my code, where I just removed the filepath.
Can anyone tell my what my mistake is?
edit to add image of xlsx file : xlsx file
edit 2 image of new Plot
import plotly.graph_objects as go
import pandas as pd
# go.renderers.default = "browser"
df = pd.read_excel (r'C:\filepath\data.xlsx')
labels=pd.DataFrame(df, columns= ['Label'])
label_list=labels.values.tolist()
sources=pd.DataFrame(df, columns= ['Source'])
source_list=sources.values.tolist()
targets=pd.DataFrame(df, columns= ['Target'])
target_list=targets.values.tolist()
values=pd.DataFrame(df, columns= ['Value'])
value_list=values.values.tolist()
fig = go.Figure(data=[go.Sankey(
node = dict(
pad = 15,
thickness = 20,
line = dict(color = "black", width = 0.5),
label = label_list,
color = "blue"
),
link = dict(
source = source_list,
target = target_list,
value = value_list
))])
fig.update_layout(title_text="Basic Sankey Diagram", font_size=10)
fig.show()
Please check the snippet with random data.
When you are reading from excel, it gives you dataframe only. You dont need to store it again to dataframe.
Neither you need to convert it to list. You can pass dataframe column directly to label,source,value and target
import plotly.graph_objects as go
import pandas as pd
df = pd.read_excel (r'data.xlsx')
print(df)
"""
Label Source Target Value
0 A1 0 2 8
1 A2 1 3 4
2 B1 0 3 2
3 B2 2 4 8
4 C1 3 4 4
5 C2 3 5 2
"""
fig = go.Figure(data=[go.Sankey(
node = dict(
pad = 15,
thickness = 20,
line = dict(color = "black", width = 0.5),
label = df['Label'],
color = "blue"
),
link = dict(
source = df['Source'],
target = df['Target'],
value = df['Value']
))])
fig.update_layout(title_text="Basic Sankey Diagram", font_size=10)
fig.show()
Tips
Your labels_list, source_list, target_list, value_list is not a list. It is a nested list.
If you want to store your dataframe columns to a list then you can do like this,
labels_list=df['Label'].tolist()
source_list=df['Source'].tolist()
target_list=df['Target'].tolist()
value_list=df['Value'].tolist()
For more details you can refer
Sankey Diagram Plotly

Plotly: How to customize markers depending on differences in a series?

I have this kind of line+marker graph plot. I am creating this chart with plotly go scatter. what I want is that, to minus two numbers in the list, if the difference is greater than 5 than it colors the marker with black.
as shown in image
y=[7,9,14,16,17,10,10]
in this case 14-9=difference is 5, 10-17=abs 5
def setcolor(x):
if x[1]-x[0]>=5
return 'black'
else:
return 'orange'
fig = go.Scatter(y=df['data'],
mode='markers+lines', name='data',
marker = dict(color=list(map(SetColor, df['data']))),
line=dict(color='rgb(200,200,200)'
))
but it's not working. I used this approach.
I would set up your absolute differences in its own series in your dataframe and then use an extra trace with mode=markers to illustrate the points that satisfy your criteria. The main benefit compared to using annotations for the same markers is that you now can use plotlys interactivity to easily hide or show the highlighted markers. The code snippet below will produce this plot:
Complete code:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
# data
df = pd.DataFrame({'No':[5,7,12,11,5,10,9,15,16,13],
'Name':['nab', 'cab', 'mun', 'city',
'coun', 'nwa', 'kra', 'ihr', 'nor', 'del']})
#df.index = list('abcdefg')
diffs=df['No'].to_list()
# find differences in list
diff=[abs(j-i) for i, j in zip(diffs[:-1], diffs[1:])]
D=[np.nan]
D=D+diff
# check if difference is greater than 5
D = [x if x >= 5 else np.nan for x in D]
df['diff']=D
df['ix']=df.index
df2=df.dropna()
# plotly setup
fig=go.Figure(go.Scatter(y=df['No'], x=df['Name'],
mode='lines+markers'))
fig.add_traces(go.Scatter(y=df2['No'], x=df2['Name'],
mode='markers', marker=dict(color='black', size=14)))
fig.show()

How to make a line plot from a pandas dataframe with a long or wide format

(This is a self-answered post to help others shorten their answers to plotly questions by not having to explain how plotly best handles data of long and wide format)
I'd like to build a plotly figure based on a pandas dataframe in as few lines as possible. I know you can do that using plotly.express, but this fails for what I would call a standard pandas dataframe; an index describing row order, and column names describing the names of a value in a dataframe:
Sample dataframe:
a b c
0 100.000000 100.000000 100.000000
1 98.493705 99.421400 101.651437
2 96.067026 98.992487 102.917373
3 95.200286 98.313601 102.822664
4 96.691675 97.674699 102.378682
An attempt:
fig=px.line(x=df.index, y = df.columns)
This raises an error:
ValueError: All arguments should have the same length. The length of argument y is 3, whereas the length of previous arguments ['x'] is 100`
Here you've tried to use a pandas dataframe of a wide format as a source for px.line.
And plotly.express is designed to be used with dataframes of a long format, often referred to as tidy data (and please take a look at that. No one explains it better that Wickham). Many, particularly those injured by years of battling with Excel, often find it easier to organize data in a wide format. So what's the difference?
Wide format:
data is presented with each different data variable in a separate column
each column has only one data type
missing values are often represented by np.nan
works best with plotly.graphobjects (go)
lines are often added to a figure using fid.add_traces()
colors are normally assigned to each trace
Example:
a b c
0 -1.085631 0.997345 0.282978
1 -2.591925 0.418745 1.934415
2 -5.018605 -0.010167 3.200351
3 -5.885345 -0.689054 3.105642
4 -4.393955 -1.327956 2.661660
5 -4.828307 0.877975 4.848446
6 -3.824253 1.264161 5.585815
7 -2.333521 0.328327 6.761644
8 -3.587401 -0.309424 7.668749
9 -5.016082 -0.449493 6.806994
Long format:
data is presented with one column containing all the values and another column listing the context of the value
missing values are simply not included in the dataset.
works best with plotly.express (px)
colors are set by a default color cycle and are assigned to each unique variable
Example:
id variable value
0 0 a -1.085631
1 1 a -2.591925
2 2 a -5.018605
3 3 a -5.885345
4 4 a -4.393955
... ... ... ...
295 95 c -4.259035
296 96 c -5.333802
297 97 c -6.211415
298 98 c -4.335615
299 99 c -3.515854
How to go from wide to long?
df = pd.melt(df, id_vars='id', value_vars=df.columns[:-1])
The two snippets below will produce the very same plot:
How to use px to plot long data?
fig = px.line(df, x='id', y='value', color='variable')
How to use go to plot wide data?
colors = px.colors.qualitative.Plotly
fig = go.Figure()
fig.add_traces(go.Scatter(x=df['id'], y = df['a'], mode = 'lines', line=dict(color=colors[0])))
fig.add_traces(go.Scatter(x=df['id'], y = df['b'], mode = 'lines', line=dict(color=colors[1])))
fig.add_traces(go.Scatter(x=df['id'], y = df['c'], mode = 'lines', line=dict(color=colors[2])))
fig.show()
By the looks of it, go is more complicated and offers perhaps more flexibility? Well, yes. And no. You can easily build a figure using px and add any go object you'd like!
Complete go snippet:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
# dataframe of a wide format
np.random.seed(123)
X = np.random.randn(100,3)
df=pd.DataFrame(X, columns=['a','b','c'])
df=df.cumsum()
df['id']=df.index
# plotly.graph_objects
colors = px.colors.qualitative.Plotly
fig = go.Figure()
fig.add_traces(go.Scatter(x=df['id'], y = df['a'], mode = 'lines', line=dict(color=colors[0])))
fig.add_traces(go.Scatter(x=df['id'], y = df['b'], mode = 'lines', line=dict(color=colors[1])))
fig.add_traces(go.Scatter(x=df['id'], y = df['c'], mode = 'lines', line=dict(color=colors[2])))
fig.show()
Complete px snippet:
import numpy as np
import pandas as pd
import plotly.express as px
from plotly.offline import iplot
# dataframe of a wide format
np.random.seed(123)
X = np.random.randn(100,3)
df=pd.DataFrame(X, columns=['a','b','c'])
df=df.cumsum()
df['id']=df.index
# dataframe of a long format
df = pd.melt(df, id_vars='id', value_vars=df.columns[:-1])
# plotly express
fig = px.line(df, x='id', y='value', color='variable')
fig.show()
I'm going to add this as answer so it will be on evidence.
First of all thank you #vestland for this. It's a question that come over and over so it's good to have this addressed and it could be easier to flag duplicated question.
Plotly Express now accepts wide-form and mixed-form data
as you can check in this post.
You can change the pandas plotting backend to use plotly:
import pandas as pd
pd.options.plotting.backend = "plotly"
Then, to get a fig all you need to write is:
fig = df.plot()
fig.show() displays the above image.

Create a stacked graph or bar graph using plotly in python

I have data like this :
[ ('2018-04-09', '10:18:11',['s1',10],['s2',15],['s3',5])
('2018-04-09', '10:20:11',['s4',8],['s2',20],['s1',10])
('2018-04-10', '10:30:11',['s4',10],['s5',6],['s6',3]) ]
I want to plot a stacked graph preferably of this data.
X-axis will be time,
it should be like this
I created this image in paint just to show.
X axis will show time like normal graph does( 10:00 ,April 3,2018).
I am stuck because the string value (like 's1',or 's2' ) will change in differnt bar graph.
Just to hard code and verify,I try this:
import plotly
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import matplotlib
plotly.offline.init_notebook_mode()
def createPage():
graph_data = []
l1=[('com.p1',1),('com.p2',2)('com.p3',3)]
l2=[('com.p1',1),('com.p4',2)('com.p5',3)]
l3=[('com.p2',8),('com.p3',2)('com.p6',30)]
trace_temp = go.Bar(
x='2018-04-09 10:18:11',
y=l1[0],
name = 'top',
)
graph_data.append(trace_temp)
plotly.offline.plot(graph_data, filename='basic-scatter3.html')
createPage()
Error I am getting is Tuple Object is not callable.
So can someone please suggest some code for how I can plot such data.
If needed,I may store data in some other form which may be helpful in plotting.
Edit :
I used the approach suggested in accepted answer and succeed in plotting using plotly like this
fig=df.iplot(kin='bar',barmode='stack',asFigure=True)
plotly.offline.plt(fig,filename="stack1.html)
However I faced one error:
1.When Time intervals are very close,Data overlaps on graph.
Is there a way to overcome it.
You could use pandas stacked bar plot. The advantage is that you can create with pandas easily the table of column/value pairs you have to generate anyhow.
from matplotlib import pyplot as plt
import pandas as pd
all_data = [('2018-04-09', '10:18:11', ['s1',10],['s2',15],['s3',5]),
('2018-04-09', '10:20:11', ['s4',8], ['s2',20],['s1',10]),
('2018-04-10', '10:30:11', ['s4',10],['s5',6], ['s6',3]) ]
#load data into dataframe
df = pd.DataFrame(all_data, columns = list("ABCDE"))
#combine the two descriptors
df["day/time"] = df["A"] + "\n" + df["B"]
#assign each list to a new row with the appropriate day/time label
df = df.melt(id_vars = ["day/time"], value_vars = ["C", "D", "E"])
#split each list into category and value
df[["category", "val"]] = pd.DataFrame(df.value.values.tolist(), index = df.index)
#create a table with category-value pairs from all lists, missing values are set to NaN
df = df.pivot(index = "day/time", columns = "category", values = "val")
#plot a stacked bar chart
df.plot(kind = "bar", stacked = True)
#give tick labels the right orientation
plt.xticks(rotation = 0)
plt.show()
Output:

Categories

Resources