How to change the space between histograms in pandas - python

I'm currently using df.hist(alpha = .5), but all of the subplots are too close from each other, like this:
Histograms
Which way is better to change the space between them?
Or is better to plot each one in a separate .png file?

One simple way is to manipulate figsize and add pyplot.tight_layout. Below is the example.
Without adjustment:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(6400)
.reshape((100, 64)), columns=['col_{}'.format(i) for i in range(64)])
df.hist(alpha=0.5)
plt.show()
You will get this as you showed:
In contrast, if you add figsize (with arbitrary size) and pyplot.tight_layout like below:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(6400)
.reshape((100, 64)), columns=['col_{}'.format(i) for i in range(64)])
df.hist(alpha=0.5, figsize=(20, 10))
plt.tight_layout()
plt.show()
In this case you will get more aligned view:
Hope this helps.

Related

A boxplot with lines connecting data points in python

I am trying to connect lines based on a specific relationship associated with the points. In this example the lines would connect the players by which court they played in. I can create the basic structure but haven't figured out a reasonably simple way to create this added feature.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df_dict={'court':[1,1,2,2,3,3,4,4],
'player':['Bob','Ian','Bob','Ian','Bob','Ian','Ian','Bob'],
'score':[6,8,12,15,8,16,11,13],
'win':['no','yes','no','yes','no','yes','no','yes']}
df=pd.DataFrame.from_dict(df_dict)
ax = sns.boxplot(x='score',y='player',data=df)
ax = sns.swarmplot(x='score',y='player',hue='win',data=df,s=10,palette=['red','green'])
plt.show()
This code generates the following plot minus the gray lines that I am after.
You can use lineplot here:
sns.lineplot(
data=df, x="score", y="player", units="court",
color=".7", estimator=None
)
The player name is converted to an integer as a flag, which is used as the value of the y-axis, and a loop process is applied to each position on the court to draw a line.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df_dict={'court':[1,1,2,2,3,3,4,4],
'player':['Bob','Ian','Bob','Ian','Bob','Ian','Ian','Bob'],
'score':[6,8,12,15,8,16,11,13],
'win':['no','yes','no','yes','no','yes','no','yes']}
df=pd.DataFrame.from_dict(df_dict)
ax = sns.boxplot(x='score',y='player',data=df)
ax = sns.swarmplot(x='score',y='player',hue='win',data=df,s=10,palette=['red','green'])
df['flg'] = df['player'].apply(lambda x: 0 if x == 'Bob' else 1)
for i in df.court.unique():
dfq = df.query('court == #i').reset_index()
ax.plot(dfq['score'], dfq['flg'], 'g-')
plt.show()

Plotting a bar chart

I have an imported excel file in python and want to create a bar chart.
In the bar chart, I want the bars to be separated by profit, 0-10, 10-20, 20-30...
How do I do this?
this is one of the things I have tried:
import NumPy as np
import matplotlib.pyplot as plt
%matplotlib inline
df.plot(kind="bar",x="profit", y="people")
df[df.profit<=10]
plt.show()
and:
df[df.profit range (10,20)]
It is a bit difficult to help you better without a sample of your data, but I constructed a dataset randomly that should have the shape of yours, so that this solution can hopefully be useful to you:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# For random data
import random
%matplotlib inline
df = pd.DataFrame({'profit':[random.choice([i for i in range(100)]) for x in range(100)], 'people':[random.choice([i for i in range(100)]) for x in range(100)]})
display(df)
out = pd.cut(df['profit'], bins=[x*10 for x in range(10)], include_lowest=True)
ax = out.value_counts(sort=False).plot.bar(rot=0, color="b", figsize=(14,4))
plt.xlabel("Profit")
plt.ylabel("People")
plt.show()
I had a look at another question on here (Pandas bar plot with binned range) and there they explained how this issue can be solved.
Hope it helps :)

setting legend only for one of the marginal plots in seaborn

I am creating a JointGrid plot using seaborn.
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
mydataset=pd.DataFrame(data=np.random.rand(50,2),columns=['a','b'])
g = sns.JointGrid(x=mydataset['a'], y=mydataset['b'])
g=g.plot_marginals(sns.distplot,color='black',kde=True,hist=False,rug=True,bins=20,label='X')
g=g.plot_joint(plt.scatter,label='X')
legend_properties = {'weight':'bold','size':8}
legendMain=g.ax_joint.legend(prop=legend_properties,loc='upper right')
legendSide=g.ax_marg_x.legend(prop=legend_properties,loc='upper right')
I get this:
I would like to get rid of the legend within the vertical marginal plot (the one on the right side) but keep the one for the horizontal margin.
how to achieve that?
EDIT: The solution from #ImportanceOfBeingErnest works fine for one plot. However, if I repeat it in a for loops something unexpected happens.
I still get a legend in the upper plot and that is unexpected.
How to get rid of it?
The following code:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
mydataset=pd.DataFrame(data=np.random.rand(50,2),columns=['a','b'])
g = sns.JointGrid(x=mydataset['a'], y=mydataset['b'])
LABEL_LIST=['x','Y','Z']
for n in range(0,3):
g=g.plot_marginals(sns.distplot,color='black',kde=True,hist=False,rug=True,bins=20,label=LABEL_LIST[n])
g=g.plot_joint(plt.scatter,label=LABEL_LIST[n])
legend_properties = {'weight':'bold','size':8}
legendMain=g.ax_joint.legend(prop=legend_properties,loc='upper right')
legendSide=g.ax_marg_y.legend(labels=[LABEL_LIST[n]],prop=legend_properties,loc='upper right')
gives:
which is almost perfect, byt I need to get rid of the last legend entry in the plo on the right.
You may decide not to give any label to the marginals, but instead add the label when creating the legend inside the top marginal axes.
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
mydataset=pd.DataFrame(data=np.random.rand(50,2),columns=['a','b'])
g = sns.JointGrid(x=mydataset['a'], y=mydataset['b'])
g=g.plot_marginals(sns.distplot,color='black',
kde=True,hist=False,rug=True,bins=20)
g=g.plot_joint(plt.scatter,label='X')
legend_properties = {'weight':'bold','size':8}
legendMain=g.ax_joint.legend(prop=legend_properties,loc='upper right')
legendSide=g.ax_marg_x.legend(labels=["x"],
prop=legend_properties,loc='upper right')
plt.show()
The solution is the same for a plot in a loop.
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
mydataset=pd.DataFrame(data=np.random.rand(50,2),columns=['a','b'])
g = sns.JointGrid(x=mydataset['a'], y=mydataset['b'])
LABEL_LIST=['x','Y','Z']
for n in range(0,3):
g=g.plot_marginals(sns.distplot,color='black',kde=True,hist=False,rug=True,bins=20)
g=g.plot_joint(plt.scatter,label=LABEL_LIST[n])
legend_properties = {'weight':'bold','size':8}
legendMain=g.ax_joint.legend(prop=legend_properties,loc='upper right')
legendSide=g.ax_marg_x.legend(labels=LABEL_LIST,prop=legend_properties,loc='upper right')
plt.show()

Changing the size of the heatmap specifically in a seaborn clustermap?

I'm making a clustered heatmap in seaborn as follows
import numpy as np
import seaborn as sns
np.random.seed(2)
data = np.random.randn(100, 10)
sns.clustermap(data)
but the rows are squished:
but if I pass a size to the clustermap function then it looks terrible
is there a way to only increase the size of the heatmap part? So that the row names can be read, but not stretch out the cluster portions.
As #mwaskom commented, I was able to use ax_heatmap.set_position along with the get_position function to achieve the correct result.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(2)
data = np.random.randn(100, 10)
cm = sns.clustermap(data)
hm = cm.ax_heatmap.get_position()
plt.setp(cm.ax_heatmap.yaxis.get_majorticklabels(), fontsize=6)
cm.ax_heatmap.set_position([hm.x0, hm.y0, hm.width*0.25, hm.height])
col = cm.ax_col_dendrogram.get_position()
cm.ax_col_dendrogram.set_position([col.x0, col.y0, col.width*0.25, col.height*0.5])
This can be done by passing the value of the dendrogram ratio in the kw arguments
import numpy as np
import seaborn as sns
np.random.seed(2)
data = np.random.randn(100, 10)
sns.clustermap(data,figsize=(12,30),dendrogram_ratio=0.02,cmap='RdBu')

plotting multiple histograms in grid

I am running following code to draw histograms in 3 by 3 grid for 9 varaibles.However, it plots only one variable.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def draw_histograms(df, variables, n_rows, n_cols):
fig=plt.figure()
for i, var_name in enumerate(variables):
ax=fig.add_subplot(n_rows,n_cols,i+1)
df[var_name].hist(bins=10,ax=ax)
plt.title(var_name+"Distribution")
plt.show()
You're adding subplots correctly but you call plt.show for each added subplot which causes what has been drawn so far to be shown, i.e. one plot. If you're for instance plotting inline in IPython you will only see the last plot drawn.
Matplotlib provides some nice examples of how to use subplots.
Your problem is fixed like:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def draw_histograms(df, variables, n_rows, n_cols):
fig=plt.figure()
for i, var_name in enumerate(variables):
ax=fig.add_subplot(n_rows,n_cols,i+1)
df[var_name].hist(bins=10,ax=ax)
ax.set_title(var_name+" Distribution")
fig.tight_layout() # Improves appearance a bit.
plt.show()
test = pd.DataFrame(np.random.randn(30, 9), columns=map(str, range(9)))
draw_histograms(test, test.columns, 3, 3)
Which gives a plot like:
In case you don't really worry about titles, here's a one-liner
df = pd.DataFrame(np.random.randint(10, size=(100, 9)))
df.hist(color='k', alpha=0.5, bins=10)

Categories

Resources