So I am working with some data for a science fair project, and I am extremely new to pandas and matplotlib/pyplot. I am currently trying to make a graph of some data (a bar graph) and have been able to do so fine. I split my DataFrame into two parts: the name and the values themselves:
data = pd.read_csv('results.csv')
data = data.sort_values(by=['Accuracy'], ascending=False)
accuracy = data['Accuracy']
names = data['Name']
This works fine. And when I go to make my graph it also works fine:
plt.bar(names, accuracy)
plt.title('Accuracy Below 97%')
plt.ylabel('Accuracy in Percent')
plt.show()
But the only problem is that when I do this, my names are too long so it ends up as a sort of blur:
I also have around 40 data points which I understand is probably too many to be able see the names anyways, but the names are around 30 characters long so even if I reduced the amount of data points in a graph, it still would probably not work.
So I then I just assumed that I would remove names from plt.bar(names, accuracy) but this throws the error:
TypeError: bar() missing 1 required positional argument: 'height'
So I realized that I need a width value, and since the number of data point was 42 I then tried:
plt.bar(42, accuracy)
But this creates a weird graph that I am not looking for:
So my question is: how do I remove the names from the graph while keeping the actual graph the same?
Any help is greatly appreciated. Thanks!
if you want to remove the xticks labels from the graph
you can do
plt.xticks([])
Also, you can adjust the x-axis limits to remove the labels completely.
plt.xticks([])
plt.xlim(-0.5, len(accuracy)-0.5)
Here is what you want but you can handle those with this link instead of deleting the problem.
datetime x-axis matplotlib labels causing uncontrolled overlap
ax = data[['Accuracy','Name']].plot(title='Accuracy Below 97%')
ax.get_xaxis().set_visible(False)
pyplot.show()
Related
I'm new to coding and this is my first post. Sorry if it could be worded better!
I'm taking a free online course, and for one of the projects I have to make a count plot with 2 subplot columns.
I've managed to make a count plot with multiple subplots using the code below, and all of the values are correct.
fig = sns.catplot(x = 'variable', hue = 'value', order = ['active', 'alco', 'cholesterol', 'gluc', 'overweight', 'smoke'], col='cardio', data = df_cat, kind = 'count')
But because of the way I've done it, the fig.axes is stored in a 2 dimensional array. The only difference between both rows of the array is the title (cardio = 0 or cardio = 1). I'm assuming this is because of the col='cardio'. Does the col argument always cause the fig.axes to be stored in a 2D array? Is there a way around this or do I have to completely change how I'm making my graph?
I'm sure it's not usually a problem, but because of this, when I run my program through the test module, it fails since some of the functions in the test module don't work on numpy.ndarrays.
I pass the test if I change the reference from fig.axes[0] to fig.axes[0,0], but obviously I cant just change the test module to pass.
I found something. This is just an implementation detail, so it would be nuts to rely on it. If you set col_wrap, then you get an axes ndarray of a different shape.
Reproduced like this:
import seaborn as sns
# I don't have your data but I have this example
tips = sns.load_dataset("tips")
fig = sns.catplot(x='day', hue='sex', col='time', data=tips, kind='count', col_wrap=2)
fig.axes.shape
And it has shape (2,) i.e it's 1D. seaborn==0.11.2.
As the title explains, I am trying to reproduce a stacked barchart where the y-axis scale is linear but the inside fill of the plot (i.e. the stacked bars) are logarithmic and grouped in the order of 10s.
I have made this plot before on R-Studio with an in-house package, however I am trying to reproduce the plot with other programs (python) to validate and confirm my analysis.
Quick description of the data w/ more detail:
I have thousands of entries of clonal cell information. They have multiple identifiers, such as "Strain", "Sample", "cloneID", as well as a frequency value ("cloneFraction") for each clone.
This is the .head() of the dataset I am working with to give you an idea of my data
I am trying to reproduce this following plot I made with R-Studio:
this one here
This plot has the dataset divided in groups based on their frequency, with the top 10 most frequent grouped in red, followed by the next top 100, next 1000, etc etc. The y-axis has a 0.00-1.00 scale but also a 100% scale wouldn't change, they mean the same thing in this context.
This is just to get an idea and visualize if I have big clones (the top 10) and how much of the overall dataset they occupy in frequency - i.e. the bigger the red stack the larger clones I have, signifying there has been a significant clonal expansion in my sample of a few selected cells.
What I have done so far:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
%matplotlib inline
MYDATAFRAME.groupby(['Sample','cloneFraction']).size().groupby(level=0).apply(lambda x: 100 * x / x.sum()).unstack().plot(kind='bar',stacked=True, legend=None)
plt.yscale('log')
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()
And I get this plot here
Now, I realize there is no order in the stacked plot, so the most frequent aren't on top - it's just stacking in the order of the entries in my dataset (which I assume I can just fix by sorting my dataframe by the column of interest).
Other than the axis messing up and not giving my a % when I use log scale (which is a secondary issue), I can't seem/wouldn't know how to group the data entries by frequency as I mentioned above.
I have tried things such as:
temp = X.SOME_IDENTIFIER.value_counts()
temp2 = temp.head(10)
if len(temp) > 10:
temp2['remaining {0} items'.format(len(temp) - 10)] = sum(temp[10:])
temp2.plot(kind='pie')
Just to see if I could separate them in a correct way but this does not achieve what I would like (other than being a pie chart, but I changed that in my code).
I have also tried using iloc[n:n] to select specific entries, but I can't seem to get that working either, as I get errors when I try adding it to the code I've used above to plot my graph - and if I use it without the other fancy stuff in the code (% scale, etc) it gets confused in the stacked barplot and just plots the top 10 out of all the 4 samples in my data, rather than the top 10 per sample. I also wouldn't know how to get the next 100, 1000, etc.
If you have any suggestions and can help in any way, that would be much appreciated!
Thanks
I fixed what I wanted to do with the following:
I created a new column with the category my samples fall in, base on their value (i.e. if they're the top 10 most frequent, next 100, etc etc).
df['category']='10001+'
for sampleref in df.sample_ref.unique().tolist():
print(f'Setting sample {sampleref}')
df.loc[df[df.sample_ref == sampleref].nlargest(10000, 'cloneCount')['category'].index,'category']='1001-10000'
df.loc[df[df.sample_ref == sampleref].nlargest(1000, 'cloneCount')['category'].index,'category']='101-1000'
df.loc[df[df.sample_ref == sampleref].nlargest(100, 'cloneCount')['category'].index,'category']='11-100'
df.loc[df[df.sample_ref == sampleref].nlargest(10, 'cloneCount')['category'].index,'category']='top10'
This code starts from the biggest group (10001+) and goes smaller and smaller, to include overlapping samples that might fall into the next big group.
Following this, I plotted the samples with the following code:
fig, ax = plt.subplots(figsize=(15,7))
df.groupby(['Sample','category']).sum()['cloneFraction'].unstack().plot(ax=ax, kind="bar", stacked=True)
plt.xticks(rotation=0)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], title='Clonotype',bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
And here are the results:
I hope this helps anyone struggling with the same issue!
I am trying to plot this DataFrame which records various amounts of money over a yearly series:
from matplotlib.dates import date2num
jp = pd.DataFrame([1000,2000,2500,3000,3250,3750,4500], index=['2011','2012','2013','2014','2015','2016','2017'])
jp.index = pd.to_datetime(jp.index, format='%Y')
jp.columns = ['Money']
I would simply like to make a bar graph out of this using PyPlot (i.e pyplot.bar).
I tried:
plt.figure(figsize=(15,5))
xvals = date2num(jp.index.date)
yvals = jp['Money']
plt.bar(xvals, yvals, color='black')
ax = plt.gca()
ax.xaxis_date()
plt.show()
But the chart turns out like this:
Only by increasing the width substantially will I start seeing the bars. I have a feeling that this graph is attributing the data to the first date of the year (2011-01-01 for example), hence the massive space between each 'bar' and the thinness of the bars.
How can I plot this properly, knowing that this is a yearly series? Ideally the y-axis would contain only the years. Something tells me that I do not need to use date2num(), since this seems like a very common, ordinary plotting exercise.
My guess as to where I'm stuck is not handling the year correctly. As of now I have them as DateTimeIndex, but maybe there are other steps I need to take.
This has puzzled me for 2 days. All solutions I found online seems to use DataFrame.plot, but I would rather learn how to use PyPlot properly. I also intend to add two more sets of bars, and it seems like the most common way to do that is through plt.bar().
Thanks everyone.
You can either do
jp.plot.bar()
which gives:
or plot against the actual years:
plt.bar(jp.index.year, jp.Money)
which gives:
I am plotting some columns of a csv using Pandas/Matplotlib. The index column is the time in seconds (which has very high number).
For example:
401287629.8
401287630.8
401287631.7
401287632.8
401287633.8
401287634.8
I need this to be printed as my xticklabel when i plot. But it is changing the number format as shown below:
plt.figure()
ax = dfPlot.plot()
legend = ax.legend(loc='center left', bbox_to_anchor=(1,0.5))
labels = ax.get_xticklabels()
for label in labels:
label.set_rotation(45)
label.set_fontsize(10)
I couldn't find a way for the xticklabel to print the exact value rather than shortened version of it.
This is essentially the same problem as How to remove relative shift in matplotlib axis
The solution is to tell the formatter to not use an offset
ax.get_xaxis().get_major_formatter().set_useOffset(False)
Also related:
useOffset=False in config file?
https://github.com/matplotlib/matplotlib/issues/2400
https://github.com/matplotlib/matplotlib/pull/2401
If it's not rude of me to point out, you're asking for a great deal of precision from a single chart. Your sample data shows a six-second difference over two times that are both over twelve and a half-years long.
You have to cut your cloth to your measure on this one. If you want to keep the years, you can't keep the seconds. If you want to keep the seconds, you can't have the years.
I am trying to graph a Pandas dataframe using Matplotlib. The dataframe contains four data columns composed of natural numbers, and an index of integers. I would like to produce a single plot with line graphs for each of the four columns, as well as error bars for each point. In addition, I would like to produce a legend providing labels for each of the four graphed lines.
Graphing the lines and legend without error bars works fine. When I introduce error bars, however, the legend becomes invalid -- the colours it uses no longer correspond to the appropriate lines. If you compare a graph with error bars and a graph without, the legend and the shapes/positions of the curves remain exactly the same. The colours of the curves get switched about, however, so that though the same four colours are used, they now correspond to different curves, meaning that the legend now assigns the wrong label to each curve.
My graphing code is thus:
def plot_normalized(agged, show_errorbars, filename):
combined = {}
# "agged" is a dictionary containing Pandas dataframes. Each dataframe
# contains both a CPS_norm_mean and CPS_norm_std column. By running the code
# below, the single dataframe "combined" is created, which has integer
# indices and a column for each of the four CPS_norm_mean columns contained
# in agged's four dataframes.
for k in agged:
combined[k] = agged[k]['CPS_norm_mean']
combined = pandas.DataFrame(combined)
plt.figure()
combined.plot()
if show_errorbars:
for k in agged:
plt.errorbar(
x=agged[k].index,
y=agged[k]['CPS_norm_mean'],
yerr=agged[k]['CPS_norm_std']
)
plt.xlabel('Time')
plt.ylabel('CPS/Absorbency')
plt.title('CPS/Absorbency vs. Time')
plt.savefig(filename)
The full 100-line script is available on GitHub. To run, download both graph.py and lux.csv, then run "python2 graph.py". It will generate two PNG files in your working directory -- one graph with error bars and one without.
The graphs are thus:
Correct graph (with no error bars):
Incorrect graph (with error bars):
Observe that the graph without error bars is properly labelled; note that the graph with error bars is improperly labelled, as though the legend is identical, the line graphs' changed colours mean that each legend entry now refers to a different (wrong) curve.
Thanks for any help you can provide. I've spent a number of extremely aggravating hours bashing my head against the wall, and I suspect that I'm making a stupid beginner's mistake. For what it's worth, I've tried with the Matplotlib development tree, version 1.2.0, and 1.1.0, and all three have exhibited identical behaviour.
I am new to programming and python in general but I managed to throw together a dirty fix, the legends are now correct, the colors are not.
def plot_normalized(agged, show_errorbars, filename):
combined = {}
for k in agged:
combined[k] = agged[k]['CPS_norm_mean']
combined = pandas.DataFrame(combined)
ax=combined.plot()
if show_errorbars:
for k in agged:
plt.errorbar(
x=agged[k].index,
y=agged[k]['CPS_norm_mean'],
yerr=agged[k]['CPS_norm_std'],
label = k #added
)
if show_errorbars: #try this, dirty fix
labels, handles = ax.get_legend_handles_labels()
N = len(handles)/2
plt.legend(labels[:N], handles[N:])
#Why does the fix work?:
#labels, handles = ax.get_legend_handles_labels()
#print handles
#out:
#[u'Blank', u'H9A', u'Q180K', u'Wildtype', 'Q180K', 'H9A', 'Wildtype', 'Blank']
#Right half has correct order, these are the labels from label=k above in errorplot
plt.xlabel('Time')
plt.ylabel('CPS/Absorbency')
plt.title('CPS/Absorbency vs. Time')
plt.savefig(filename)
Produces: