How do I plot ordered categorical data? - python

I possess categorical data in a dataframe that correspond to the x-axis and the y-axis of a plot, and where these categories are ordered (e.g. a<b<c<d). However, after failing to find a similar example on Stack Overflow on how to plot said data as I have done below, I assigned ordered numbers to the categories (e.g. a,b,c,d,e were indexed by 4,5,6,7,8) and plotted the following graph
using the following lines of code:
fig=plt.figure()
plt.scatter(df3["LCST"],df3["LCST.1"])
lims = [
np.min([ax.get_xlim(), ax.get_ylim()]), # min of both axes
np.max([ax.get_xlim(), ax.get_ylim()]), # max of both axes
]
plt.plot(lims, lims, 'k-', alpha=0.75, zorder=0)
plt.xlabel('Actual', fontsize=18)
plt.ylabel('Estimated', fontsize=18)
ax.set_aspect('equal')
ax.set_xlim(lims)
ax.set_ylim(lims)
The issue is that this plot is not particularly informative and I would prefer to actually have the actual categories "a,b,c,d,..." on the axis, as opposed to the mere numbers (besides the half numbers in the axis are misleading, as they do not correspond to any categories). I tried doing the same exercise without converting the categories to numbers first, but of course the ordering was all off. Sure! I have come across many posts concerning ordering categorical data, but for some reason these are not particularly fruitful for my particular case. How can this be done? My Data is of the following format (note below is just for the sake of an example and it does not correspond to the graph plotted above):
where, say, I want to produce the same graph as above, only with the categories as the axis, such that CCC+ < B < A < A+. Any help would very much be appreciated. Thanks

Related

My x-axis is messed up for huge datasets?

I am trying to plot a countplot using Seaborn library. The data-set is a huge dataset with lots of data of more than 100,000 entries and 67 columns. I have tried plotting it and my x-axis gets messed up. I have tried to increase the figure-size of the plot but still it does not work for me. My code and figure of the plot is as follows:
#We will see what is the status of columns that have null values or comprise of values that are zero
na = pd.DataFrame(df.isnull().sum())
plt.figure(figsize=(25,25))
sns.barplot(y=na[0],x=na.index)
plt.xlabel(xlabel=na.index)
plt.title("Columns with Null Values Distribution",size=20)
Any suggestion to plot in such a way that the x-axis gets more clearer and is easily being able to be visualised would be helpful. Thank you for your help.
So I have basically found the solution to this question and that basically is by swapping x-axis and y-axis. I also have amended my code to use dpi as well. The code which gives me proper visualisation is as follows:
#We will see what is the status of columns that have null values or comprise of values that are zero
na = pd.DataFrame(df.isnull().sum())
plt.figure(num=None, figsize=(20,18), dpi=80, facecolor='w', edgecolor='r')
sns.barplot(y=na.index,x=na[0])
#plt.xlabel(xlabel=na.index)
plt.title("Columns with Null Values Distribution",size=10)
The visualisation is as follows:

Python stacked barchart where y-axis scale is linear but the bar fill is logarithmic in the order of 10s

As the title explains, I am trying to reproduce a stacked barchart where the y-axis scale is linear but the inside fill of the plot (i.e. the stacked bars) are logarithmic and grouped in the order of 10s.
I have made this plot before on R-Studio with an in-house package, however I am trying to reproduce the plot with other programs (python) to validate and confirm my analysis.
Quick description of the data w/ more detail:
I have thousands of entries of clonal cell information. They have multiple identifiers, such as "Strain", "Sample", "cloneID", as well as a frequency value ("cloneFraction") for each clone.
This is the .head() of the dataset I am working with to give you an idea of my data
I am trying to reproduce this following plot I made with R-Studio:
this one here
This plot has the dataset divided in groups based on their frequency, with the top 10 most frequent grouped in red, followed by the next top 100, next 1000, etc etc. The y-axis has a 0.00-1.00 scale but also a 100% scale wouldn't change, they mean the same thing in this context.
This is just to get an idea and visualize if I have big clones (the top 10) and how much of the overall dataset they occupy in frequency - i.e. the bigger the red stack the larger clones I have, signifying there has been a significant clonal expansion in my sample of a few selected cells.
What I have done so far:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
%matplotlib inline
MYDATAFRAME.groupby(['Sample','cloneFraction']).size().groupby(level=0).apply(lambda x: 100 * x / x.sum()).unstack().plot(kind='bar',stacked=True, legend=None)
plt.yscale('log')
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()
And I get this plot here
Now, I realize there is no order in the stacked plot, so the most frequent aren't on top - it's just stacking in the order of the entries in my dataset (which I assume I can just fix by sorting my dataframe by the column of interest).
Other than the axis messing up and not giving my a % when I use log scale (which is a secondary issue), I can't seem/wouldn't know how to group the data entries by frequency as I mentioned above.
I have tried things such as:
temp = X.SOME_IDENTIFIER.value_counts()
temp2 = temp.head(10)
if len(temp) > 10:
temp2['remaining {0} items'.format(len(temp) - 10)] = sum(temp[10:])
temp2.plot(kind='pie')
Just to see if I could separate them in a correct way but this does not achieve what I would like (other than being a pie chart, but I changed that in my code).
I have also tried using iloc[n:n] to select specific entries, but I can't seem to get that working either, as I get errors when I try adding it to the code I've used above to plot my graph - and if I use it without the other fancy stuff in the code (% scale, etc) it gets confused in the stacked barplot and just plots the top 10 out of all the 4 samples in my data, rather than the top 10 per sample. I also wouldn't know how to get the next 100, 1000, etc.
If you have any suggestions and can help in any way, that would be much appreciated!
Thanks
I fixed what I wanted to do with the following:
I created a new column with the category my samples fall in, base on their value (i.e. if they're the top 10 most frequent, next 100, etc etc).
df['category']='10001+'
for sampleref in df.sample_ref.unique().tolist():
print(f'Setting sample {sampleref}')
df.loc[df[df.sample_ref == sampleref].nlargest(10000, 'cloneCount')['category'].index,'category']='1001-10000'
df.loc[df[df.sample_ref == sampleref].nlargest(1000, 'cloneCount')['category'].index,'category']='101-1000'
df.loc[df[df.sample_ref == sampleref].nlargest(100, 'cloneCount')['category'].index,'category']='11-100'
df.loc[df[df.sample_ref == sampleref].nlargest(10, 'cloneCount')['category'].index,'category']='top10'
This code starts from the biggest group (10001+) and goes smaller and smaller, to include overlapping samples that might fall into the next big group.
Following this, I plotted the samples with the following code:
fig, ax = plt.subplots(figsize=(15,7))
df.groupby(['Sample','category']).sum()['cloneFraction'].unstack().plot(ax=ax, kind="bar", stacked=True)
plt.xticks(rotation=0)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], title='Clonotype',bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
And here are the results:
I hope this helps anyone struggling with the same issue!

Matplotlib plotting data that doesnt exist

I am trying to plot three lines on one figure. I have data for three years for three sites and i am simply trying to plot them with the same x axis and same y axis. The first two lines span all three years of data, while the third dataset is usually more sparse. Using the object-oriented axes matplotlib format, when i try to plot my third set of data, I get points at the end of the graph that are out of the range of my third set of data. my third dataset is structured as tuples of dates and values such as:
data=
[('2019-07-15', 30.6),
('2019-07-16', 20.88),
('2019-07-17', 16.94),
('2019-07-18', 11.99),
('2019-07-19', 13.76),
('2019-07-20', 16.97),
('2019-07-21', 19.9),
('2019-07-22', 25.56),
('2019-07-23', 18.59),
...
('2020-08-11', 8.33),
('2020-08-12', 10.06),
('2020-08-13', 12.21),
('2020-08-15', 6.94),
('2020-08-16', 5.51),
('2020-08-17', 6.98),
('2020-08-18', 6.17)]
where the data ends in August 2020, yet the graph includes points at the end of 2020. This is happening with all my sites, as the first two datasets stay constant knowndf['DATE'] and knowndf['Value'] below.
Here is the problematic graph.
And here is what I have for the plotting:
fig, ax=plt.subplots(1,1,figsize=(15,12))
fig.tight_layout(pad=6)
ax.plot(knowndf['DATE'], knowndf['Value1'],'b',alpha=0.7)
ax.plot(knowndf['DATE'], knowndf['Value2'],color='red',alpha=0.7)
ax.plot(*zip(*data), 'g*', markersize=8) #when i plot this set of data i get nonexistent points
ax.tick_params(axis='x', rotation=45) #rotating for aesthetic
ax.set_xticks(ax.get_xticks()[::30]) #only want every 30th tick instead of every daily tick
I've tried ax.twinx() and that gives me two y axis that doesn't help me since i want to use the same x-axis and y-axis for all three sites. I've tried not using the axes approach, but there are things that come with axes that i need to plot with. Please please help!

Seaborn: Violinplot experiences difficulty with too many variables?

I wanted to use seaborn to visualize my entire Pandas dataframe with violinplots, and I thought I had made the necessary corrections to generate a large graph for the sizable number of 270 variables my dataframe possessed.
However, no matter what I do, the violinplots only display their inner mini-boxplots (as another question here describes) for each variable, and not their kde's:
fig, ax = plt.subplots(figsize=(50,5))
ax.set_ylim(-6, 6)
a = sns.violinplot(x='variable', y='value', data=pd.melt(train_norm), ax=ax)
a.set_xticklabels(a.get_xticklabels(), rotation=90);
plt.savefig('massive_violinplot.png', figsize=(50,5), dpi=220)
(apologies for the cropped graph, the whole thing is too big to post)
Whereas the following code, using the same pd.Dataframe, but only showing the first six variables, displays correctly:
fig, ax = plt.subplots(figsize=(10,5))
ax.set_ylim(-6, 6)
a = sns.violinplot(x='variable', y='value', data=pd.melt(train_norm.iloc[:,:6]), ax=ax)
a.set_xticklabels(a.get_xticklabels(), rotation=90);
plt.savefig('massive_violinplot.png', figsize=(10,5), dpi=220)
How could I get a graph like the above for all the variables, filled with proper violinplots showing their kde's?
This is not related to the number of variables or the plot size but to the huge differences in the distributions of the variables. I can't access your data right now so I will ilustrate it with a made up dataset. You can follow along with your dataset, selecting the three variables with more dispersion and the three with less dispersion. As a dispersion measurement you can use the variance or even the data range (if you don't have crazy long tails) or something different, I am not sure what would work better.
rs = np.random.RandomState(42)
data = rs.randn(100, 6)
data[:, :3] *= 20
df = pd.DataFrame(data)
See what happens if we plot the density with common axes so they are directly comparable.
df.plot(kind='kde', subplots=True, layout=(3, 2), sharex=True, sharey=True)
plt.tight_layout()
This is more or less the same you can see in the seaborn violin plot but of course transposed.
sns.violinplot(x='variable', y='value', data=pd.melt(df))
This is usually great for comparing the variables because you can look at the differences in width as differences in density. Unfortunately the violin for the variables with more dispersion are so narrow that you can't see the width at all and you lose any sense of the shape. On the other hand the variables with less dispersion appear too short (actually in your dataset some of them are just horizontal lines).
For the first problem you can make the violins use all the available horizontal space by using scale='width' but then you no longer can compare the density across variables. The width is the same at the peaks but the density is not.
sns.violinplot(x='variable', y='value', data=pd.melt(df), scale='width')
By the way, this is what matplotlib's violin plot does by default.
plt.violinplot(df.T)
For the second problem I think your only option is to normalize or standardize the variables in some way.
sns.violinplot(x='variable', y='value', data=pd.melt((df - df.mean()) / df.std()))
Now you have a clearer view of each variable separately (how many modes they have, how skewed they are, how long the tails are...) but you can compare neither the scale nor the dispersion across variables.
The moral of the story is that you can't see everything at once, you have to pick and choose depending on what you are looking for in the data.

Adding error bars to Matplotlib-generated graph of Pandas dataframe creates invalid legend

I am trying to graph a Pandas dataframe using Matplotlib. The dataframe contains four data columns composed of natural numbers, and an index of integers. I would like to produce a single plot with line graphs for each of the four columns, as well as error bars for each point. In addition, I would like to produce a legend providing labels for each of the four graphed lines.
Graphing the lines and legend without error bars works fine. When I introduce error bars, however, the legend becomes invalid -- the colours it uses no longer correspond to the appropriate lines. If you compare a graph with error bars and a graph without, the legend and the shapes/positions of the curves remain exactly the same. The colours of the curves get switched about, however, so that though the same four colours are used, they now correspond to different curves, meaning that the legend now assigns the wrong label to each curve.
My graphing code is thus:
def plot_normalized(agged, show_errorbars, filename):
combined = {}
# "agged" is a dictionary containing Pandas dataframes. Each dataframe
# contains both a CPS_norm_mean and CPS_norm_std column. By running the code
# below, the single dataframe "combined" is created, which has integer
# indices and a column for each of the four CPS_norm_mean columns contained
# in agged's four dataframes.
for k in agged:
combined[k] = agged[k]['CPS_norm_mean']
combined = pandas.DataFrame(combined)
plt.figure()
combined.plot()
if show_errorbars:
for k in agged:
plt.errorbar(
x=agged[k].index,
y=agged[k]['CPS_norm_mean'],
yerr=agged[k]['CPS_norm_std']
)
plt.xlabel('Time')
plt.ylabel('CPS/Absorbency')
plt.title('CPS/Absorbency vs. Time')
plt.savefig(filename)
The full 100-line script is available on GitHub. To run, download both graph.py and lux.csv, then run "python2 graph.py". It will generate two PNG files in your working directory -- one graph with error bars and one without.
The graphs are thus:
Correct graph (with no error bars):
Incorrect graph (with error bars):
Observe that the graph without error bars is properly labelled; note that the graph with error bars is improperly labelled, as though the legend is identical, the line graphs' changed colours mean that each legend entry now refers to a different (wrong) curve.
Thanks for any help you can provide. I've spent a number of extremely aggravating hours bashing my head against the wall, and I suspect that I'm making a stupid beginner's mistake. For what it's worth, I've tried with the Matplotlib development tree, version 1.2.0, and 1.1.0, and all three have exhibited identical behaviour.
I am new to programming and python in general but I managed to throw together a dirty fix, the legends are now correct, the colors are not.
def plot_normalized(agged, show_errorbars, filename):
combined = {}
for k in agged:
combined[k] = agged[k]['CPS_norm_mean']
combined = pandas.DataFrame(combined)
ax=combined.plot()
if show_errorbars:
for k in agged:
plt.errorbar(
x=agged[k].index,
y=agged[k]['CPS_norm_mean'],
yerr=agged[k]['CPS_norm_std'],
label = k #added
)
if show_errorbars: #try this, dirty fix
labels, handles = ax.get_legend_handles_labels()
N = len(handles)/2
plt.legend(labels[:N], handles[N:])
#Why does the fix work?:
#labels, handles = ax.get_legend_handles_labels()
#print handles
#out:
#[u'Blank', u'H9A', u'Q180K', u'Wildtype', 'Q180K', 'H9A', 'Wildtype', 'Blank']
#Right half has correct order, these are the labels from label=k above in errorplot
plt.xlabel('Time')
plt.ylabel('CPS/Absorbency')
plt.title('CPS/Absorbency vs. Time')
plt.savefig(filename)
Produces:

Categories

Resources