python: How to graph Sankey graph in a dash app

python: How to graph Sankey graph in a dash app - python

I have been trying to create a Sankey graph on a dash application using the plotly tutorial, but I have had no sucess in achieving that. I have been stuck with the below error dash.exceptions.InvalidCallbackReturnValue: The callback ..my_bee_map.figure.. is a multi-output. I can figure out how to fix that.
my end objective is to see the flow of Labels and years with values >=0.75. what that means is that when I choose a label e.g cs.AI, I want to see the flow of labels, years with >=0.75 values correlated to the label cs.AI
Here is my dash app and the values for the plot.
import pandas as pd
import plotly.express as px # (version 4.7.0 or higher)
import plotly.graph_objects as go
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import numpy as np
from dash import Dash # pip install dash (version 2.0.0 or higher)
app = Dash(__name__)
df = pd.read_csv("data.csv")
app.layout = html.Div([
html.H1("Web Application Dashboards with Dash", style={'text-align': 'center'}),
dcc.Dropdown(id="slct_label",
options=[{'label': x, 'value': x} for x in
sorted(df["Label1"].unique())],
multi=False,
value="cs.AI",
style={'width': "30%"}
),
html.Br(),
dcc.Dropdown(id="slct_value",
options=[{'label': x, 'value': x} for x in
sorted(df[df["value"] >= 0.75]["value"].unique())],
multi=False,
value=0.75,
style={'width': "40%"},
placeholder="Select threshold"
),
# html.Div(id='output_container', children=[]),
html.Br(),
dcc.Graph(id='my_bee_map', figure={})
])
# Callback - app interactivity section------------------------------------
#app.callback(
[Output(component_id='output_container', component_property='children'),
Output(component_id='my_bee_map', component_property='figure')],
[Input(component_id='slct_label', component_property='value'),
Input(component_id='slct_value', component_property='value')]
)
def update_graph(slct_label, slct_value):
# print(slct_label,slct_value)
# print(type(slct_label), type(slct_value))
container = "The year chosen by user was: {}".format(slct_label)
dff = df.copy()
# if slct_label:
dff = df[df['Label1'] != slct_label]
label = list(dff['Label1'].unique())
source = np.repeat(np.arange(0, 5), 18).tolist()
Value = list(dff['value'])
target = list(dff['Label1'])
color = np.repeat(np.array(['#a6cee3','#fdbf6f','#fb9a99','#e3a6ce','#a6e3da'],dtype=object), 18).tolist()
link = dict(source=source,target=target,value=Value,color=color)
node = dict(label=label,pad=35,thickness=15)
data = go.Sankey(link=link,node=node)
#graph the Sankey
fig = go.Figure(data)
fig.update_layout(
hovermode= 'x',
title = 'Migration from 1990 to 1992',
font=dict(size=10,color='white'),
paper_bgcolor='#51504f')
return fig, container
if __name__ == '__main__':
app.run_server()
my data
Label1 value year
0 cs.AI 1.00 1990
1 cs.AI 0.20 1990
2 cs.AI 0.85 1990
3 cs.AI 0.99 1990
4 cs.AI 0.19 1990
5 cs.AI 0.87 1990
6 cs.CC 0.19 1990
7 cs.CC 1.00 1990
8 cs.CC 0.34 1990
9 cs.CC 0.50 1990
10 cs.CC 0.09 1990
11 cs.CC 0.67 1990
12 cs.CE 0.94 1990
13 cs.CE 0.63 1990
14 cs.CE 1.00 1990
15 cs.CE 0.61 1990
16 cs.CE 0.82 1990
17 cs.CE 0.17 1990
18 cs.CG 0.74 1990
19 cs.CG 0.95 1990
20 cs.CG 0.53 1990
21 cs.CG 1.00 1990
22 cs.CG 0.43 1990
23 cs.CG 0.10 1990
24 cs.CL 0.31 1990
25 cs.CL 0.27 1990
26 cs.CL 0.91 1990
27 cs.CL 0.21 1990
28 cs.CL 1.00 1990
29 cs.CL 0.12 1990
30 cs.CR 0.31 1990
31 cs.CR 0.18 1990
32 cs.CR 0.76 1990
33 cs.CR 0.35 1990
34 cs.CR 0.67 1990
35 cs.CR 1.00 1990
36 cs.AI 1.00 1991
37 cs.AI 0.55 1991
38 cs.AI 0.82 1991
39 cs.AI 0.05 1991
40 cs.AI 0.17 1991
41 cs.AI 0.83 1991
42 cs.CC 0.52 1991
43 cs.CC 1.00 1991
44 cs.CC 0.64 1991
45 cs.CC 1.00 1991
46 cs.CC 0.80 1991
47 cs.CC 0.21 1991
48 cs.CE 0.10 1991
49 cs.CE 0.58 1991
50 cs.CE 1.00 1991
51 cs.CE 0.01 1991
52 cs.CE 0.77 1991
53 cs.CE 0.19 1991
54 cs.CG 0.08 1991
55 cs.CG 0.21 1991
56 cs.CG 0.63 1991
57 cs.CG 1.00 1991
58 cs.CG 0.34 1991
59 cs.CG 0.60 1991
60 cs.CL 0.91 1991
61 cs.CL 0.33 1991
62 cs.CL 0.60 1991
63 cs.CL 0.57 1991
64 cs.CL 1.00 1991
65 cs.CL 0.37 1991
66 cs.CR 1.00 1991
67 cs.CR 0.28 1991
68 cs.CR 0.92 1991
69 cs.CR 0.47 1991
70 cs.CR 0.53 1991
71 cs.CR 1.00 1991
72 cs.AI 1.00 1992
73 cs.AI 0.79 1992
74 cs.AI 0.86 1992
75 cs.AI 0.30 1992
76 cs.AI 0.27 1992
77 cs.AI 0.91 1992
78 cs.CC 0.06 1992
79 cs.CC 1.00 1992
80 cs.CC 0.72 1992
81 cs.CC 0.44 1992
82 cs.CC 0.31 1992
83 cs.CC 0.75 1992
84 cs.CE 0.40 1992
85 cs.CE 0.07 1992
86 cs.CE 1.00 1992
87 cs.CE 0.88 1992
88 cs.CE 0.79 1992
89 cs.CE 0.03 1992
90 cs.CG 0.74 1992
91 cs.CG 0.91 1992
92 cs.CG 1.00 1992
93 cs.CG 1.00 1992
94 cs.CG 0.68 1992
95 cs.CG 0.22 1992
96 cs.CL 0.42 1992
97 cs.CL 0.03 1992
98 cs.CL 0.95 1992
99 cs.CL 0.17 1992
100 cs.CL 1.00 1992
101 cs.CL 0.28 1992
102 cs.CR 0.04 1992
103 cs.CR 0.30 1992
104 cs.CR 0.26 1992
105 cs.CR 0.80 1992
106 cs.CR 0.90 1992
107 cs.CR 1.00 1992

what you put after "return" statement has to be placed in the same order as in callbacks. It means that in your code, the return should be: "return container, fig" instead of "return fig, container".
If it does not work, try to separate callback with one output sankey figure only (and do not use square bracket outside Output statement when you have only one output). You can make another callback for container output.

Related

Pandas dataframe dosen't recognize values in list

I have a list that looks something like this:
[ deptht depthb clay silt sand OM bulk_density pH sat_hidric_cond
0 89 152 NaN NaN NaN NaN NaN NaN 0.000074
1 0 25 0.20 0.72 0.08 2.00 1.30 5.8 0.000917
2 25 89 0.34 0.58 0.08 0.25 1.48 5.0 0.000091,
deptht depthb clay silt sand OM bulk_density pH sat_hidric_cond
29 0 25 0.07 0.12 0.81 3.0 1.20 6.3 0.0055
32 25 44 0.05 0.11 0.84 1.7 1.20 6.1 0.0055
41 44 70 0.04 0.08 0.88 0.6 1.58 6.4 0.0055
50 70 203 0.02 0.03 0.95 0.3 1.60 7.2 0.0055,
deptht depthb clay silt sand OM bulk_density pH \
6 157 203 0.335 0.323 0.342 0.25 1.90 7.9
8 0 25 0.225 0.527 0.248 2.00 1.40 6.2
9 25 66 0.420 0.502 0.078 0.75 1.53 6.5
12 66 109 0.240 0.518 0.242 0.25 1.53 7.5
15 109 157 0.240 0.560 0.200 0.25 1.45 7.9
sat_hidric_cond
6 0.000074
8 0.000917
9 0.000282
12 0.000776
15 0.000776 ,
deptht depthb clay silt sand OM bulk_density pH \
0 71 109 0.100 0.234 0.666 0.25 1.68 5.8
1 109 152 0.100 0.265 0.635 0.25 1.70 8.2
3 0 23 0.085 0.237 0.678 2.00 1.45 6.2
4 23 71 0.210 0.184 0.606 0.25 1.55 5.5
sat_hidric_cond
0 0.0023
1 0.0023
3 0.0028
4 0.0009 ,
deptht depthb clay silt sand OM bulk_density pH \
3 0 25 0.11 0.230 0.660 0.75 1.55 7.2
4 25 76 0.14 0.192 0.668 0.25 1.55 7.2
6 76 152 0.14 0.556 0.304 0.00 1.75 8.2
sat_hidric_cond
3 0.002800
4 0.002800
6 0.000091 ]
when I try to transform my list into a DataFrame with soil = pd.DataFrame(data)
I get this output
0
0 deptht depthb clay silt sand OM bul...
1 deptht depthb clay silt sand OM bul...
2 deptht depthb clay silt sand OM ...
3 deptht depthb clay silt sand OM ...
4 deptht depthb clay silt sand OM b...
Those are the five elements of my list but is not recognizing the values associated to each variable.
However when I use the squeeze function soil = soil.iloc[1].squeeze()
I get something similar to what I want as result:
deptht depthb clay silt sand OM bulk_density pH sat_hidric_cond
29 0 25 0.07 0.12 0.81 3.0 1.20 6.3 0.0055
32 25 44 0.05 0.11 0.84 1.7 1.20 6.1 0.0055
41 44 70 0.04 0.08 0.88 0.6 1.58 6.4 0.0055
50 70 203 0.02 0.03 0.95 0.3 1.60 7.2 0.0055
But I have to use the iloc function to individually select each element of the list.
What I'm looking for is a method that I can apply to the whole list and get an output like I get when I use the pandas squeeze method.
Any help is greatly appreciated.
Thank you very much.

data is a list and it seems you want to extract the second element of the list:
soil = pd.DataFrame(data[1])

Reading in a .txt file to get time series from rows of years and columns of monthly values

How could I read in a txt file like the one from
https://psl.noaa.gov/data/correlation/pna.data (example below)
1960 -0.16 -0.22 -0.69 -0.07 0.99 1.20 1.11 1.85 -0.01 0.48 -0.52 1.15
1961 1.16 0.17 0.28 -1.14 -0.25 1.84 -0.52 0.47 1.10 -1.94 -0.40 -1.54
1962 -0.74 -0.54 -0.71 -1.50 -1.11 -0.97 -0.36 0.57 -0.83 1.33 0.53 -0.38
1963 0.09 0.79 -2.04 -0.79 -0.95 0.50 -1.10 -1.01 0.87 0.93 -0.31 1.46
1964 -0.44 1.36 -1.31 -1.30 -2.27 0.27 0.20 0.83 0.92 0.80 -0.78 -2.03
1965 -0.92 -1.03 -0.80 -1.07 -0.42 1.89 -1.26 0.32 0.36 1.42 -0.81 -1.56
into a pandas dataframe to plot as a time series, for example from 1960-1965 with each value column (corresponding to months) being plotted? I rarely use .txt's

Here's what you can try:
import pandas as pd
import requests
import re
aa=requests.get("https://psl.noaa.gov/data/correlation/pna.data").text
aa=aa.split("\n")[1:-4]
aa=list(map(lambda x:x[1:],aa))
aa="\n".join(aa)
aa=re.sub(" +",",",aa)
with open("test.csv","w") as f:
f.write(aa)
df=pd.read_csv("test.csv", header=None, index_col=0).rename_axis('Year')
df.columns=list(pd.date_range(start='2021-01', freq='M', periods=12).month_name())
print(df.head())
df.to_csv("test.csv")
This is going to give you, in test.csv file:
Year
January
February
March.....
up to December
1948
73
67
67
773....
1949
73
67
67
773....
1950
73
67
67
773....
....
..
..
..
.......
....
..
..
..
.......
2021
73
88
84
733....

Use pd.read_fwf as suggested by #SanskarSingh
>>> pd.read_fwf('data.txt', header=None, index_col=0).rename_axis('Year')
1 2 3 4 5 6 7 8 9 10 11 12
Year
1960 -0.16 -0.22 -0.69 -0.07 0.99 1.20 1.11 1.85 -0.01 0.48 -0.52 1.15
1961 1.16 0.17 0.28 -1.14 -0.25 1.84 -0.52 0.47 1.10 -1.94 -0.40 -1.54
1962 -0.74 -0.54 -0.71 -1.50 -1.11 -0.97 -0.36 0.57 -0.83 1.33 0.53 -0.38
1963 0.09 0.79 -2.04 -0.79 -0.95 0.50 -1.10 -1.01 0.87 0.93 -0.31 1.46
1964 -0.44 1.36 -1.31 -1.30 -2.27 0.27 0.20 0.83 0.92 0.80 -0.78 -2.03
1965 -0.92 -1.03 -0.80 -1.07 -0.42 1.89 -1.26 0.32 0.36 1.42 -0.81 -1.56

How to convert the long to wide format in Pandas dataframe?

I am having dataframe df in below format
Date TLRA_CAPE TLRA_Pct B_CAPE B_Pct RC_CAPE RC_Pct
1/1/2000 10 0.20 30 0.40 50 0.60
2/1/2000 15 0.25 35 0.45 55 0.65
3/1/2000 17 0.27 37 0.47 57 0.6
I need to convert into below format
Date Variable CAPE Pct
1/1/2000 TLRA 10 0.20
2/1/2000 TLRA 15 0.25
3/1/2000 TLRA 17 0.27
1/1/2000 B 30 0.40
2/1/2000 B 35 0.45
3/1/2000 B 37 0.47
1/1/2000 RC 50 0.60
2/1/2000 RC 55 0.65
3/1/2000 RC 57 0.6
I am struggling to convert to required format. I tried using pd.melt , pd.pivot but those are not working.

After change of your columns you can do with wide_to_long, and you have both PCT and Pct, I assumed that is typo, if not , do df.columns=df.columns.str.upper()
df=df.set_index('Date')
df.columns=df.columns.str.split('_').map(lambda x : '_'.join(x[::-1]))
pd.wide_to_long(df.reset_index(),['CAPE','Pct'],i='Date',j='Variable',sep='_',suffix='\w+')
Out[63]:
CAPE Pct
Date Variable
1/1/2000 TLRA 10 0.20
2/1/2000 TLRA 15 0.25
3/1/2000 TLRA 17 0.27
1/1/2000 B 30 0.40
2/1/2000 B 35 0.45
3/1/2000 B 37 0.47
1/1/2000 RC 50 0.60
2/1/2000 RC 55 0.65
3/1/2000 RC 57 0.60

Converting string/numerical data to categorical format in pandas

I have a very large csv file that I have converted to a Pandas dataframe, which has string and integer/float values. I would like to change this data to categorical format in order to try and save some memory. I am basing this idea off of the documentation here: https://pandas.pydata.org/pandas-docs/version/0.20/categorical.html
My dataframe looks like the following:
clean_data_measurements.head(20)
station date prcp tobs
0 USC00519397 1/1/2010 0.08 65
1 USC00519397 1/2/2010 0.00 63
2 USC00519397 1/3/2010 0.00 74
3 USC00519397 1/4/2010 0.00 76
5 USC00519397 1/7/2010 0.06 70
6 USC00519397 1/8/2010 0.00 64
7 USC00519397 1/9/2010 0.00 68
8 USC00519397 1/10/2010 0.00 73
9 USC00519397 1/11/2010 0.01 64
10 USC00519397 1/12/2010 0.00 61
11 USC00519397 1/14/2010 0.00 66
12 USC00519397 1/15/2010 0.00 65
13 USC00519397 1/16/2010 0.00 68
14 USC00519397 1/17/2010 0.00 64
15 USC00519397 1/18/2010 0.00 72
16 USC00519397 1/19/2010 0.00 66
17 USC00519397 1/20/2010 0.00 66
18 USC00519397 1/21/2010 0.00 69
19 USC00519397 1/22/2010 0.00 67
20 USC00519397 1/23/2010 0.00 67
It is precipitation data which goes on another 2700 rows. Since it is all of the same category (station number), it should be convertible to categorical format which will save processing time. I am just unsure of how to write the code. Can anyone help? Thanks.

I think we can convert object to category data by using factorize
objectdf=df.select_dtypes(include='object')
df.loc[:,objectdf.columns]=objectdf.apply(lambda x : pd.factorize(x)[0])
df
Out[452]:
station date prcp tobs
0 0 0 0.08 65
1 0 1 0.00 63
2 0 2 0.00 74
3 0 3 0.00 76
5 0 4 0.06 70
6 0 5 0.00 64
7 0 6 0.00 68
8 0 7 0.00 73
9 0 8 0.01 64
10 0 9 0.00 61
11 0 10 0.00 66
12 0 11 0.00 65
13 0 12 0.00 68
14 0 13 0.00 64
15 0 14 0.00 72
16 0 15 0.00 66
17 0 16 0.00 66
18 0 17 0.00 69
19 0 18 0.00 67
20 0 19 0.00 67
You can try this as well.
for y,x in zip(df.columns,df.dtypes):
if x == 'object':
df[y]=pd.factorize(df[y])[0]
elif x=='int64':
df[y]=df[y].astype(np.int8)
else:
df[y]=df[y].astype(np.float32)

Inner join on 2 columns for two dataframes python

I have 2 dataframes named geostat and geostat_query. I am trying to do inner join on 2 columns. The code that I have written is giving me empty result.
My dataframes are:
geostat:
STATE COUNT PERCENT state pool number STATE CODE
0 0.00 251 CA
1 0.00 252 CA
2 0.00 253 CA
3 0.00 787 CA
4 0.00 789 CA
5 0.00 4401 CA
6 0.00 4402 CA
7 0.00 4403 CA
8 0.00 4404 CA
9 0.00 4406 CA
10 0.00 4568 CA
11 0.00 4569 FL
12 0.00 4576 CA
13 0.00 4577 CA
14 0.00 4578 CA
15 0.00 4579 CA
16 0.00 4580 CA
17 0.00 4581 CA
18 0.00 4582 CA
19 0.00 4584 CA
20 0.00 4585 CA
21 0.00 4588 CA
22 0.00 4589 CA
23 0.00 4591 CA
24 0.00 4592 CA
25 0.00 4593 CA
26 0.00 4594 FL
27 0.00 4595 CA
28 0.00 4595 FL
29 0.00 6221 MS
30 0.00 817085 GA
31 0.03 817085 IL
32 0.03 817085 IN
33 0.03 817085 MA
34 0.03 817085 ME
35 0.07 817085 MI
36 0.07 817085 MO
37 0.03 817085 NE
38 0.07 817085 OH
39 0.03 817085 PA
40 0.03 817085 SC
41 0.03 817085 SD
42 0.03 817085 TX
43 0.07 817085 WI
44 0.08 817094 AL
45 0.09 817094 CA
geostat_query:
MaxOfState count percent state pool number
0 100 251
1 100 252
2 100 253
3 100 787
4 100 789
5 100 4401
6 100 4402
7 100 4403
8 100 4404
9 100 4406
10 100 4568
11 100 4569
12 100 4576
13 100 4577
14 100 4578
15 100 4579
16 100 4580
17 100 4581
18 100 4582
19 100 4584
20 100 4585
21 100 4588
22 100 4589
23 100 4591
24 100 4592
25 100 4593
26 100 4594
27 75 4595
28 100 6221
29 100 8194
The code I wrote is :
geomerge = geostat.merge(geostat_query, left_on=['STATE COUNT PERCENT','state pool number'], right_on=['MaxOfState count percent','state pool number'],how='inner')
But this gives me empty result. I dont understand where am I going wrong?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python: How to graph Sankey graph in a dash app - python

Related

Pandas dataframe dosen't recognize values in list

Reading in a .txt file to get time series from rows of years and columns of monthly values

How to convert the long to wide format in Pandas dataframe?

Converting string/numerical data to categorical format in pandas

Inner join on 2 columns for two dataframes python

Categories

Resources