Matplotlib stacked histogram from Pandas is being cut by a strange line

Matplotlib stacked histogram from Pandas is being cut by a strange line - python

something strange is happening in matplotlib.
I have a pandas dataframe and I'm making a stacked histogram using two of its columns. One column is floats that goes into the histogram bins. The other column is only 0's and 1's, which are used to separate the data into two stacks. My actual code is bit more complicated but it goes something like this:
print(df)
df =
col1 col2
1.7 1
2.4 0
3.1 0
4.0 1
etc etc
# First I separate the data by the 0's and 1's in col2
df_1 = df.loc[df['col2']==1]
df_0 = df.loc[df['col2']==0]
fig, axes =
Plotting with matplotlib's histogram function works fine, sort of. If I call this:
fig,axes= plt.subplots(nrows=1, ncols=1)
n,bins,patches= axes.hist( [ df_0['col1'], df_1['col1'] ] , histtype='step', stacked=True, Fill=True)
...I get this very nice plot:
HOWEVER, something very strange happens if I flip the order of df_0 and df_1 when I call hist().
Like if I do this instead:
n,bins,patches= axes[0].hist( [ df_1['col1'], df_0['col1'] ] , histtype='step', stacked=True, Fill=True)
... I get a plot with the stacks flipped (as expected), BUT now the plot has picked up a strange artifact; there's like an invisible line that is cutting off and filling in some places of the graph with color.
What the heck is going on here? My first thought was that maybe column1 or column2 had NaN values or something, but I checked those and the column values are fine. Any ideas on what might be causing this?

fill is not a useful argument to hist. It is a valid argument, because you may fill any patch in matplotlib. However, here you do not have a closed patch to fill.
Instead you may be looking for the different histtype options that are shown in the histogram_histtypes example.
histtype="stepfilled"
histtype='bar'
In this case they both give the same plot,
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np; np.random.seed(42)
a = np.random.rayleigh(size=20)
b = np.random.randn(20)+3
df = pd.DataFrame({"col1" : np.concatenate((a,b)),
"col2" : [0]*20 + [1]*20})
df_1 = df.loc[df['col2']==1]
df_0 = df.loc[df['col2']==0]
fig,axes= plt.subplots(ncols=2)
n,bins,patches= axes[0].hist([df_0['col1'], df_1['col1']], histtype='stepfilled', stacked=True)
n,bins,patches= axes[1].hist([df_0['col1'], df_1['col1']], histtype='bar', stacked=True)
plt.show()

Related

Multi Index Seaborn Line Plot

I have a multi index dataframe, with the two indices being Sample and Lithology
Sample 20EC-P 20EC-8 20EC-10-1 ... 20EC-43 20EC-45 20EC-54
Lithology Pd Di-Grd Gb ... Hbl Plag Pd Di-Grd Gb
Rb 7.401575 39.055118 6.456693 ... 0.629921 56.535433 11.653543
Ba 24.610102 43.067678 10.716841 ... 1.073115 58.520532 56.946630
Th 3.176471 19.647059 3.647059 ... 0.823529 29.647059 5.294118
I am trying to put it into a seaborn lineplot as such.
spider = sns.lineplot(data = data, hue = data.columns.get_level_values("Lithology"),
style = data.columns.get_level_values("Sample"),
dashes = False, palette = "deep")
The lineplot comes out as
1
I have two issues. First, I want to format hues by lithology and style by sample. Outside of the lineplot function, I can successfully access sample and lithology using data.columns.get_level_values, but in the lineplot they don't seem to do anything and I haven't figured out another way to access these values. Also, the lineplot reorganizes the x-axis by alphabetical order. I want to force it to keep the same order as the dataframe, but I don't see any way to do this in the documentation.

To use hue= and style=, seaborn prefers it's dataframes in long form. pd.melt() will combine all columns and create new columns with the old column names, and a column for the values. The index too needs to be converted to a regular column (with .reset_index()).
Most seaborn functions use order= to set an order on the x-values, but with lineplot the only way is to make the column categorical applying a fixed order.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
column_tuples = [('20EC-P', 'Pd '), ('20EC-8', 'Di-Grd'), ('20EC-10-1 ', 'Gb'),
('20EC-43', 'Hbl Plag Pd'), ('20EC-45', 'Di-Grd'), ('20EC-54', 'Gb')]
col_index = pd.MultiIndex.from_tuples(column_tuples, names=["Sample", "Lithology"])
data = pd.DataFrame(np.random.uniform(0, 50, size=(3, len(col_index))), columns=col_index, index=['Rb', 'Ba', 'Th'])
data_long = data.melt(ignore_index=False).reset_index()
data_long['index'] = pd.Categorical(data_long['index'], data.index) # make categorical, use order of the original dataframe
ax = sns.lineplot(data=data_long, x='index', y='value',
hue="Lithology", style="Sample", dashes=False, markers=True, palette="deep")
ax.set_xlabel('')
ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1.02))
plt.tight_layout() # fit legend and labels into the figure
plt.show()
The long dataframe looks like:
index Sample Lithology value
0 Rb 20EC-P Pd 6.135005
1 Ba 20EC-P Pd 6.924961
2 Th 20EC-P Pd 44.270570
...

Unusual bar plot in matplotlib

I need to create a somewhat unusual bar plot in matplotlib and the standard functionality does not seem to offer what I need.
I have clustered some documents and want to show the 5 most important keywords per cluster. The first problem is that I have one group per cluster which consists of 5 individual bars. The second problem is that the labels of these individual bars are important, not the same across groups and not unique either.
I have a makeshift prototype that looks like this:
I just plotted all the individual bars in the right order and separated them by empty entries. The biggest problem (aside from being ugly) is that the only way to identify the cluster is by counting the groups. It would help a lot if the clusters could be identified either by color or something else, but I cannot figure out how to do this.
Edit: Here is some requested toy data as well as the code used to produce the plot I already have.
Toy data:
The following two pandas dataframes are included in an array. The two code blocks include the results from df_list[i].to_csv(). I hope this helps, but for the context of this problem the actual data does not really matter, so you can also just create your own dataframes.
,features,score
0,knowledg,0.09862235117497174
1,manag,0.07812351138840486
2,innov,0.06502084705448799
3,organ,0.0561819290497529
4,km,0.05580332888282127
and
,features,score
0,knowledg,0.04217018718591911
1,develop,0.03423580137595049
2,manag,0.032239226503136
3,system,0.031064303713788467
4,sustain,0.029628875636649198
Code:
The approach for the current solution is to combine all the individual dataframes into one dataframe, add empty entries where necessary, and plot the result.
def plot_all_clusters_words(dfs):
# target structure: word as non unique column, value as other non unique column
df_dict_list = []
for df in dfs:
for index, row in df.iterrows():
df_dict_list.append({"word": row.features, "value": row.score})
df_dict_list.append({"word": "", "value": 0})
df_dict_list = df_dict_list[:-1]
new_df = pd.DataFrame(df_dict_list)
new_df.plot.bar(x="word")
plt.show()
return new_df
Note:
I just need a way to easily identify the groups, if you know a different approach than the ones I suggested above, feel free to do so.

Calling plt.bar for each of the dataframes, each with an own label and color, would create the following plot:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from io import StringIO
df1_str = '''features,score
0,knowledg,0.09862235117497174
1,manag,0.07812351138840486
2,innov,0.06502084705448799
3,organ,0.0561819290497529
4,km,0.05580332888282127'''
df2_str = '''features,score
0,knowledg,0.04217018718591911
1,develop,0.03423580137595049
2,manag,0.032239226503136
3,system,0.031064303713788467
4,sustain,0.029628875636649198'''
df1 = pd.read_csv(StringIO(df1_str))
df2 = pd.read_csv(StringIO(df2_str))
dfs = [df1, df2]
cluster_names = [f'cluster {i}' for i in range(1, len(dfs) + 1)]
colors = plt.cm.rainbow(np.linspace(0, 1, len(dfs)))
bar_width = 0.8 # width of individual bars
cluster_gap = 0.2 # extra distance between clusters
starts = np.append(0, np.array([len(df) + cluster_gap for df in dfs]).cumsum())
all_tickpos = [s + np.arange(len(df)) for df, s in zip(dfs, starts)]
for df, name, color, tickpos in zip(dfs, cluster_names, colors, all_tickpos):
plt.bar(tickpos, df['score'], width=bar_width, color=color, label=name)
plt.xticks(np.concatenate(all_tickpos), [f for df in dfs for f in df['features']], rotation=90)
plt.legend()
plt.tight_layout()
plt.show()

Multiple data frame columns plotted in the same bar without overlapping

I have a pandas dataframe:
import pandas as pd
data1 = {'Date':['03-19-2019'],
'Total':[35],
'Solved':[19],
'Arrived':[23],
}
df1 = pd.DataFrame(data1)
and I want to plot a bar plot like this:
with
df1.plot(kind='barh',x='Date',y='Total', ax=ax0, color='#C0C0C0',
width=0.5)
df1.plot(kind='barh',x='Date',y='Arrived', ax=ax0, color='#C0FFFF',
width=0.5)
df1.plot(kind='barh',x='Date',y='Solved', ax=ax0, color='#C0C0FF',
width=0.5)
However, to avoid overlapping, I have to draw each column taking into account which of them has the bigger value.(Total greater than Arrived greater than Solved)
How can I avoid to do this and automate this process easily?

There must be a straightforward and simpler approach in Pandas but I just came up with this quick workaround. The idea is following:
Leave out the first column Date and sort the remaining columns.
Use the sorted indices for plotting the columns in ascending order
To make the colors consistent, you can make use of dictionary so that the ascending/descending order doesn't affect your colors.
fig, ax0 = plt.subplots()
ids = np.argsort(df1.values[0][1:])[::-1]
colors = {'Total': '#C0C0C0', 'Arrived': '#C0FFFF', 'Solved':'#C0C0FF'}
for col in np.array(df1.columns[1:].tolist())[ids]:
df1.plot(kind='barh',x='Date',y=col, ax=ax0, color=colors[col], width=0.1)

A stacked bar graph can be produced in pandas via the stacked=True option. To use this you need to make the "Date" the index first.
import matplotlib.pyplot as plt
import pandas as pd
data1 = {'Date':['03-19-2019'],
'Total':[35],
'Solved':[19],
'Arrived':[23],
}
df = pd.DataFrame(data1)
df.set_index("Date").plot(kind="barh", stacked=True)
plt.show()

Adjusting the color coding on a barplot so that all values are color coded correctly in matplotlib

I have a barplot that plots Rates by State and by Category (there are 5 categories) but the problem is that some States have more categories than other states.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"state" : ["AL","AL","AL","AK", ],
"status" : ["Booked", "Rejected","Cancelled","Rejected"],
"0" : [1.5,2.5,3.5,1.0]})
df2 = df.groupby(['state','status']).size()/df.groupby(['state']).size()
fig, ax = plt.subplots()
plt.xlabel('State')
plt.ylabel('Bookings')
my_colors = 'gyr'
df2.plot(kind='bar', color=my_colors, orientation='vertical')
plt.tight_layout()
plt.show()
This does a good job with most of what I need to do however, what happens is that because some States do not have all values for status and hence do not appear in the plot, it makes some of the color coding incorrect because the colors are just shifted to repeat every 5 colors rather then based on whenever a value is missing or not. What can I do about this?

Possibly you want to show the data in a grouped fashion, namely to have 3 categories per group, such that each category has its own color.
In this case it seems this can easily be achieved by unstacking the multi-index dataframe,
df2.unstack().plot(...)
Complete example:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"state" : ["AL","AL","AL","AK", ],
"status" : ["Booked", "Rejected","Cancelled","Rejected"],
"0" : [1.5,2.5,3.5,1.0]})
df2 = df.groupby(['state','status']).size()/df.groupby(['state']).size()
fig, ax = plt.subplots()
plt.xlabel('State')
plt.ylabel('Bookings')
my_colors = 'gyr'
df2.unstack().plot(kind='bar', color=my_colors, orientation='vertical', ax=ax)
plt.tight_layout()
plt.show()

How to plot multiple linear regressions in the same figure

Given the following:
import numpy as np
import pandas as pd
import seaborn as sns
np.random.seed(365)
x1 = np.random.randn(50)
y1 = np.random.randn(50) * 100
x2 = np.random.randn(50)
y2 = np.random.randn(50) * 100
df1 = pd.DataFrame({'x1':x1, 'y1': y1})
df2 = pd.DataFrame({'x2':x2, 'y2': y2})
sns.lmplot('x1', 'y1', df1, fit_reg=True, ci = None)
sns.lmplot('x2', 'y2', df2, fit_reg=True, ci = None)
This will create 2 separate plots. How can I add the data from df2 onto the SAME graph? All the seaborn examples I have found online seem to focus on how you can create adjacent graphs (say, via the 'hue' and 'col_wrap' options). Also, I prefer not to use the dataset examples where an additional column might be present as this does not have a natural meaning in the project I am working on.
If there is a mixture of matplotlib/seaborn functions that are required to achieve this, I would be grateful if someone could help illustrate.

You could use seaborn's FacetGrid class to get desired result.
You would need to replace your plotting calls with these lines:
# sns.lmplot('x1', 'y1', df1, fit_reg=True, ci = None)
# sns.lmplot('x2', 'y2', df2, fit_reg=True, ci = None)
df = pd.concat([df1.rename(columns={'x1':'x','y1':'y'})
.join(pd.Series(['df1']*len(df1), name='df')),
df2.rename(columns={'x2':'x','y2':'y'})
.join(pd.Series(['df2']*len(df2), name='df'))],
ignore_index=True)
pal = dict(df1="red", df2="blue")
g = sns.FacetGrid(df, hue='df', palette=pal, size=5);
g.map(plt.scatter, "x", "y", s=50, alpha=.7, linewidth=.5, edgecolor="white")
g.map(sns.regplot, "x", "y", ci=None, robust=1)
g.add_legend();
This will yield this plot:
Which is if I understand correctly is what you need.
Note that you will need to pay attention to .regplot parameters and may want to change the values I have put as an example.
; at the end of the line is to suppress output of the command (I use ipython notebook where it's visible).
Docs give some explanation on the .map() method. In essence, it does just that, maps plotting command with data. However it will work with 'low-level' plotting commands like regplot, and not lmlplot, which is actually calling regplot behind the scene.
Normally plt.scatter would take parameters: c='none', edgecolor='r' to make non-filled markers. But seaborn is interfering the process and enforcing color to the markers, so I don't see an easy/straigtforward way to fix this, but to manipulate ax elements after seaborn has produced the plot, which is best to be addressed as part of a different question.

Option 1: sns.regplot
In this case, the easiest to implement solution is to use sns.regplot, which is an axes-level function, because this will not require combining df1 and df2.
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
# create the figure and axes
fig, ax = plt.subplots(figsize=(6, 6))
# add the plots for each dataframe
sns.regplot(x='x1', y='y1', data=df1, fit_reg=True, ci=None, ax=ax, label='df1')
sns.regplot(x='x2', y='y2', data=df2, fit_reg=True, ci=None, ax=ax, label='df2')
ax.set(ylabel='y', xlabel='x')
ax.legend()
plt.show()
Option 2: sns.lmplot
As per sns.FacetGrid, it is better to use figure-level functions than to use FacetGrid directly.
Combine df1 and df2 into a long format, and then use sns.lmplot with the hue parameter.
When working with seaborn, it is almost always necessary for the data to be in a long format.
It's customary to use pandas.DataFrame.stack or pandas.melt to convert DataFrames from wide to long.
For this reason, df1 and df2 must have the columns renamed, and have an additional identifying column. This allows them to be concatenated on axis=0 (the default long format), instead of axis=1 (a wide format).
There are a number of ways to combine the DataFrames:
The combination method in the answer from Primer is fine if combining a few DataFrames.
However, a function, as shown below, is better for combining many DataFrames.
def fix_df(data: pd.DataFrame, name: str) -> pd.DataFrame:
"""rename columns and add a column"""
# rename columns to a common name
data.columns = ['x', 'y']
# add an identifying value to use with hue
data['df'] = name
return data
# create a list of the dataframes
df_list = [df1, df2]
# update the dataframes by calling the function in a list comprehension
df_update_list = [fix_df(v, f'df{i}') for i, v in enumerate(df_list, 1)]
# combine the dataframes
df = pd.concat(df_update_list).reset_index(drop=True)
# plot the dataframe
sns.lmplot(data=df, x='x', y='y', hue='df', ci=None)
Notes
Package versions used for this answer:
pandas v1.2.4
seaborn v0.11.1
matplotlib v3.3.4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.