I am trying to write a for loop that for distplot subplots.
I have a dataframe with many columns of different lengths. (not including the NaN values)
fig = make_subplots(
rows=len(assets), cols=1,
y_title = 'Hourly Price Distribution')
i=1
for col in df_all.columns:
fig = ff.create_distplot([[df_all[[col]].dropna()]], col)
fig.append()
i+=1
fig.show()
I am trying to run a for loop for subplots for distplots and get the following error:
PlotlyError: Oops! Your data lists or ndarrays should be the same length.
UPDATE:
This is an example below:
df = pd.DataFrame({'2012': np.random.randn(20),
'2013': np.random.randn(20)+1})
df['2012'].iloc[0] = np.nan
fig = ff.create_distplot([df[c].dropna() for c in df.columns],
df.columns,show_hist=False,show_rug=False)
fig.show()
I would like to plot each distribution in a different subplot.
Thank you.
Update: Distribution plots
Calculating the correct values is probably both quicker and more elegant using numpy. But I often build parts of my graphs using one plotly approach(figure factory, plotly express) and then use them with other elements of the plotly library (plotly.graph_objects) to get what I want. The complete snippet below shows you how to do just that in order to build a go based subplot with elements from ff.create_distplot. I'd be happy to give further explanations if the following suggestion suits your needs.
Plot
Complete code
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.graph_objects as go
df = pd.DataFrame({'2012': np.random.randn(20),
'2013': np.random.randn(20)+1})
df['2012'].iloc[0] = np.nan
df = df.reset_index()
dfm = pd.melt(df, id_vars=['index'], value_vars=df.columns[1:])
dfm = dfm.dropna()
dfm.rename(columns={'variable':'year'}, inplace = True)
cols = dfm.year.unique()
nrows = len(cols)
fig = make_subplots(rows=nrows, cols=1)
for r, col in enumerate(cols, 1):
dfs = dfm[dfm['year']==col]
fx1 = ff.create_distplot([dfs['value'].values], ['distplot'],curve_type='kde')
fig.add_trace(go.Scatter(
x= fx1.data[1]['x'],
y =fx1.data[1]['y'],
), row = r, col = 1)
fig.show()
First suggestion
You should:
1. Restructure your data with pd.melt(df, id_vars=['index'], value_vars=df.columns[1:]),
2. and the use the occuring column 'variable' to build subplots for each year through the facet_row argument to get this:
In the complete snippet below you'll see that I've changed 'variable' to 'year' in order to make the plot more intuitive. There's one particularly convenient side-effect with this approach, namely that running dfm.dropna() will remove the na value for 2012 only. If you were to do the same thing on your original dataframe, the corresponding value in the same row for 2013 would also be removed.
import numpy as np
import pandas as pd
import plotly.express as px
df = pd.DataFrame({'2012': np.random.randn(20),
'2013': np.random.randn(20)+1})
df['2012'].iloc[0] = np.nan
df = df.reset_index()
dfm = pd.melt(df, id_vars=['index'], value_vars=df.columns[1:])
dfm = dfm.dropna()
dfm.rename(columns={'variable':'year'}, inplace = True)
fig = px.histogram(dfm, x="value",
facet_row = 'year')
fig.show()
Related
I want to create a multi layer graph with the same data frame from pandas.
One should be a boxplot and the other a scatter to see where the company is located.
Is there a way to combine both plots?
boxplot
scatterplot
import pandas as pd
import plotly.express as px
df = pd.read_csv("company_index.csv", sep=";", decimal=",")
print(df)
df_u9 = df.loc[df["company"].isin(["U9"])]
fig_1 = px.box(
df,
x="period",
y="index"
)
fig_2 = px.scatter(
df_u9,
x="period",
y="index"
)
fig_1.show()
fig_2.show()
company_index.csv
period;index;company
1;202,4;U1
1;226,69;U10
1;235,18;U9
1;236,49;U4
1;238,13;U2
1;244,05;U6
1;252,08;U3
1;256,68;U8
1;294,99;U5
1;299,391;U7
2;243,78;U1
2;264,26;U10
2;270,6;U2
2;272,89;U9
2;285,26;U5
2;289,29;U4
2;291,15;U6
2;291,19;U3
2;305,92;U7
2;314,65;U8
3;271,82;U1
3;278,65;U2
3;296,16;U10
3;297,21;U4
3;305,93;U6
3;308,96;U5
3;323,74;U9
3;335,93;U3
3;354,13;U8
3;381,2;U7
4;281,26;U5
4;308,5;U2
4;311,61;U1
4;334,03;U4
4;335,72;U9
4;344,32;U8
4;345,27;U6
4;355,44;U3
4;373,54;U7
4;381,68;U10
5;288,6;U1
5;305,66;U5
5;323,2;U2
5;358,46;U8
5;365,57;U3
5;366,96;U10
5;368,38;U7
5;371,23;U6
5;373,63;U4
5;422,93;U9
6;285,32;U5
6;291,65;U1
6;308,68;U2
6;372,04;U8
6;376,64;U3
6;403,55;U6
6;407,38;U4
6;420,65;U10
6;423,68;U9
6;453,09;U7
Found this solution. Works rather well.
Im still struggling to understand the ".data[0]" but i believe its referring to the first fig in use. Maybe if you have multiple graphs.
import pandas as pd
import plotly.express as px
df = pd.read_csv("company_index.csv", sep=";", decimal=",")
print(df)
df_u9 = df.loc[df["company"].isin(["U9"])].copy()
df_u9["size"] = 1
fig = px.box(
df,
x="period",
y="index"
)
fig.add_trace(px.scatter(
df_u9,
x="period",
y="index",
size="size",
size_max=15,
color_discrete_sequence=(203,153,201)
).data[0])
fig.show()
Whenever I plot a dataset to a bar plot, the x axis labels overload with labels. How can I change the datatype of the x axis from the dataframe or how can I display every nth label?
Here is my code:
# Import statements for the packages to be used.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
# Loading the data and having a look at the first few lines
df = pd.read_csv('tmdb-movies.csv')
df.head()
# Replace 0 with NaN (Not a Number)
df['budget'].replace(0, np.NAN, inplace=True)
df['runtime'].replace(0, np.NAN, inplace=True)
# Drop all rows with null values (NaN)
df.dropna(axis=0, inplace=True)
# Drop all columns not required for investigation
df = df.drop(['id', 'imdb_id', 'revenue', 'cast', 'homepage', 'director', 'tagline',
'keywords', 'overview', 'genres', 'production_companies', 'vote_count', 'vote_average',
'release_date', 'budget_adj', 'revenue_adj'], axis=1)
budget_grp = df.groupby(['budget'])
budget_grp['popularity'].agg(['median', 'mean'])
# Setting mean popularity to variable budget_pop.
budget_pop = budget_grp['popularity'].mean()
# Bar plot with x as budget and y as average popularity.
budget_pop.plot(kind='bar' ,x='budget', y='popularity', figsize=(20,10), xlabel='Budget in
Dollars', ylabel='Average Popularity', rot=0, legend=True)
I have tried enumerate but don't know where to fit that in. I have also tried creating a function to find nth and I have tried changing my dataframe to integer but they always error.
Follow-up to answer below:
You can use custom xticks:
df = pd.DataFrame(data={'x':np.arange(1,1001,1), 'y':np.random.randint(1,1000,1000)})
ax = df.plot(kind='bar' ,x='x', y='y', figsize=(20,10))
min_value_in_x = 1
max_value_in_x = 1000
x_ticks = np.arange(min_value_in_x, max_value_in_x, 100)
ax.set(xticks=x_ticks, xticklabels=x_ticks)
plt.show()
I have a data frame that contains multiple variables where each variable is logically connected to a factor level of an additional group variable. I would like to plot a histogram of each variable in such a way that it is possible to show a grid of multiple histograms 'group-wise'.
Here's an example data frame df_melt (the variables var_1,var_2,var_3,var_4 are logically connected to the factor level 'foo', the variables var_5,var_6,var_7 belong to factor level 'bar'):
import numpy as np
import pandas as pd
# simulate data and create plot-ready dataframe
np.random.seed(42)
var_values = np.random.randint(low=1,high=100,size=(100,7))
var_names = ['var_1','var_2','var_3','var_4','var_5','var_6','var_7']
group_names = ['foo','foo','foo','foo','bar','bar','bar']
df = pd.DataFrame(var_values,columns=var_names)
multi_index = pd.MultiIndex.from_arrays([df.columns,group_names],names=['variable','group'])
df.columns = multi_index
df_melt = pd.melt(df)
The output should look like this:
These stackoverflow posts might help to provide an answer, but I was not able to come up with a solution on my own:
Plotting a grouped pandas data in plotly
Plotly equivalent for pd.DataFrame.hist
Best I came up with is the following. Sadly, this is not in the nicely plotted format that you wanted, but I think/hope you can start with this.
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# simulate data and create plot-ready dataframe
np.random.seed(42)
var_values = np.random.randint(low=1,high=100,size=(100,7))
var_names = ['var_1','var_2','var_3','var_4','var_5','var_6','var_7']
group_names = ['foo','foo','foo','foo','bar','bar','bar']
df = pd.DataFrame(var_values,columns=var_names)
multi_index = pd.MultiIndex.from_arrays([df.columns,group_names],names=['variable','group'])
df.columns = multi_index
df_melt = pd.melt(df)
uniq_cols = set(group_names)
for col in uniq_cols:
rows = df_melt[df_melt['group']==col]['variable'].unique()
# print(list(rows))
num_vars = len(rows)
fig = make_subplots(rows=1, cols=len(rows), column_titles=list(rows))
for i, row in enumerate(rows):
fig.add_trace(go.Histogram(x=df_melt[(df_melt['group']==col) & (df_melt['variable']==row)]['value']),
row=1, col=i+1)
fig.show()
I need to create a line chart from multiple columns of a dataframe. In pandas, you can draw a multiple line chart using a code as follows:
df.plot(x='date', y=['sessions', 'cost'], figsize=(20,10), grid=True)
How can this be done using plotly_express?
With version 4.8 of Plotly.py, the code in the original question is now supported almost unmodified:
pd.options.plotting.backend = "plotly"
df.plot(x='date', y=['sessions', 'cost'])
Previous answer, as of July 2019
For this example, you could prepare the data slightly differently.
df_melt = df.melt(id_vars='date', value_vars=['sessions', 'cost'])
If you transpose/melt your columns (sessions, cost) into additional rows, then you can specify the new column 'variable' to partition by in the color parameter.
px.line(df_melt, x='date' , y='value' , color='variable')
Example plotly_express output
With newer versions of plotly, all you need is:
df.plot()
As long as you remember to set pandas plotting backend to plotly:
pd.options.plotting.backend = "plotly"
From here you can easily adjust your plot to your liking, for example setting the theme:
df.plot(template='plotly_dark')
Plot with dark theme:
One particularly awesome feature with newer versions of plotly is that you no longer have to worry whether your pandas dataframe is of a wide or long format. Either way, all you need is df.plot(). Check out the details in the snippet below.
Complete code:
# imports
import plotly.express as px
import pandas as pd
import numpy as np
# settings
pd.options.plotting.backend = "plotly"
# sample dataframe of a wide format
np.random.seed(4); cols = list('abc')
X = np.random.randn(50,len(cols))
df=pd.DataFrame(X, columns=cols)
df.iloc[0]=0; df=df.cumsum()
# plotly figure
df.plot(template = 'plotly_dark')
Answer for older versions:
I would highly suggest using iplot() instead if you'd like to use plotly in a Jupyter Notebook for example:
Plot:
Code:
import plotly
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import pandas as pd
import numpy as np
# setup
init_notebook_mode(connected=True)
np.random.seed(123)
cf.set_config_file(theme='pearl')
# Random data using cufflinks
df1 = cf.datagen.lines()
df2 = cf.datagen.lines()
df3 = cf.datagen.lines()
df = pd.merge(df1, df2, how='left',left_index = True, right_index = True)
df = pd.merge(df, df3, how='left',left_index = True, right_index = True)
fig = df1.iplot(asFigure=True, kind='scatter',xTitle='Dates',yTitle='Returns',title='Returns')
iplot(fig)
Its also worth pointing out you can combine plotly express with graph_objs. This is a good route when the lines have different x points.
import numpy as np
import pandas as pd
import plotly.graph_objs as go
import plotly.express as px
# data set 1
x = np.linspace(0, 9, 10)
y = x
# data set 2
df = pd.DataFrame(np.column_stack([x*0.5, y]), columns=["x", "y"])
fig = go.Figure(px.scatter(df, x="x", y="y"))
fig.add_trace(go.Scatter(x=x, y=y))
fig.show()
I wish to create a seaborn pointplot to display the full data distribution in a column, alongside the distribution of the lowest 25% of values, and the distribution of the highest 25% of values, and all side by side (on the x axis).
My attempt so far provides me with the values, but they are displayed on the same part of the x-axis only and not spread out from left to right on the graph, and with no obvious way to label the points from x-ticks (which I would prefer , rather than via a legend).
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
df = sns.load_dataset('tips')
df1 = df[(df.total_bill < df.total_bill.quantile(.25))]
df2 = df[(df.total_bill > df.total_bill.quantile(.75))]
sns.pointplot(y=df['total_bill'], data=df, color='red')
sns.pointplot(y=df1['total_bill'], data=df1, color='green')
sns.pointplot(y=df2['total_bill'], data=df2, color='blue')
You could .join() the new distributions to your existing df and then .plot() using wide format:
lower, upper = df.total_bill.quantile([.25, .75]).values.tolist()
df = df.join(df.loc[df.total_bill < lower, 'total_bill'], rsuffix='_lower')
df = df.join(df.loc[df.total_bill > upper, 'total_bill'], rsuffix='_upper')
sns.pointplot(data=df.loc[:, [c for c in df.columns if c.startswith('total')]])
to get:
If you wanted to add groups, you could simply use .unstack() to get to long format:
df = df.loc[:, ['total_bill', 'total_bill_upper', 'total_bill_lower']].unstack().reset_index().drop('level_1', axis=1).dropna()
df.columns = ['grp', 'val']
to get:
sns.pointplot(x='grp', y='val', hue='grp', data=df)
I would think along the lines of adding a "group" and then plot as a single DataFrame.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
df = sns.load_dataset('tips')
df = df.append(df)
df.loc[(df.total_bill < df.total_bill.quantile(.25)),'group'] = 'L'
df.loc[(df.total_bill > df.total_bill.quantile(.75)),'group'] = 'H'
df = df.reset_index(drop=True)
df.loc[len(df)/2:,'group'] = 'all'
sns.pointplot(data = df,
y='total_bill',
x='group',
hue='group',
linestyles='')