I have been working on a dataframe plot, where I have 2k values to show and its indexes. However, when I try to plot them, obviously, matplotlib tries to squeeze all the index labels in the axis, making it impossible to read. It looks like this:
I have tried to increase figsize, but there are just too much xticks. Since the data is not linear (there are maybe 900 values between 0 and 1, 600 values between 1-100 and the rest >100), I cannot just re-arange the xticks with np.arange(), since the data loses correlation (there are 2k bars, but the last index is 1534.34). If I try xticks=np.arange(0, max1, max1/20) to get 20 ticks evenly spaced, I get this result:
Which obviously, is not true, since like I said, the max index is 1534.34, so that the last tick should be at the very end of the horizontal axis. Also, as as said, the first 900 values are between 0 and 1, and this is not true in the plot.
I don't know if I've been clear enough or this question is adequate, but this is my first question and I have tried to. So don't be too harsh please. However all criticism is welcomed. Thanks.
I use ax.xaxis.set_major_locator(ticker.MaxNLocator(interval)) to set my ticks on the x axis after defining the interval as some integer. See documentation here. I think this relies on calling the graph in a form such as:
fig = plt.figure(figsize = (5,2))
ax = fig.add_axes([0, 0, 1, 1])
Related
As the title explains, I am trying to reproduce a stacked barchart where the y-axis scale is linear but the inside fill of the plot (i.e. the stacked bars) are logarithmic and grouped in the order of 10s.
I have made this plot before on R-Studio with an in-house package, however I am trying to reproduce the plot with other programs (python) to validate and confirm my analysis.
Quick description of the data w/ more detail:
I have thousands of entries of clonal cell information. They have multiple identifiers, such as "Strain", "Sample", "cloneID", as well as a frequency value ("cloneFraction") for each clone.
This is the .head() of the dataset I am working with to give you an idea of my data
I am trying to reproduce this following plot I made with R-Studio:
this one here
This plot has the dataset divided in groups based on their frequency, with the top 10 most frequent grouped in red, followed by the next top 100, next 1000, etc etc. The y-axis has a 0.00-1.00 scale but also a 100% scale wouldn't change, they mean the same thing in this context.
This is just to get an idea and visualize if I have big clones (the top 10) and how much of the overall dataset they occupy in frequency - i.e. the bigger the red stack the larger clones I have, signifying there has been a significant clonal expansion in my sample of a few selected cells.
What I have done so far:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
%matplotlib inline
MYDATAFRAME.groupby(['Sample','cloneFraction']).size().groupby(level=0).apply(lambda x: 100 * x / x.sum()).unstack().plot(kind='bar',stacked=True, legend=None)
plt.yscale('log')
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()
And I get this plot here
Now, I realize there is no order in the stacked plot, so the most frequent aren't on top - it's just stacking in the order of the entries in my dataset (which I assume I can just fix by sorting my dataframe by the column of interest).
Other than the axis messing up and not giving my a % when I use log scale (which is a secondary issue), I can't seem/wouldn't know how to group the data entries by frequency as I mentioned above.
I have tried things such as:
temp = X.SOME_IDENTIFIER.value_counts()
temp2 = temp.head(10)
if len(temp) > 10:
temp2['remaining {0} items'.format(len(temp) - 10)] = sum(temp[10:])
temp2.plot(kind='pie')
Just to see if I could separate them in a correct way but this does not achieve what I would like (other than being a pie chart, but I changed that in my code).
I have also tried using iloc[n:n] to select specific entries, but I can't seem to get that working either, as I get errors when I try adding it to the code I've used above to plot my graph - and if I use it without the other fancy stuff in the code (% scale, etc) it gets confused in the stacked barplot and just plots the top 10 out of all the 4 samples in my data, rather than the top 10 per sample. I also wouldn't know how to get the next 100, 1000, etc.
If you have any suggestions and can help in any way, that would be much appreciated!
Thanks
I fixed what I wanted to do with the following:
I created a new column with the category my samples fall in, base on their value (i.e. if they're the top 10 most frequent, next 100, etc etc).
df['category']='10001+'
for sampleref in df.sample_ref.unique().tolist():
print(f'Setting sample {sampleref}')
df.loc[df[df.sample_ref == sampleref].nlargest(10000, 'cloneCount')['category'].index,'category']='1001-10000'
df.loc[df[df.sample_ref == sampleref].nlargest(1000, 'cloneCount')['category'].index,'category']='101-1000'
df.loc[df[df.sample_ref == sampleref].nlargest(100, 'cloneCount')['category'].index,'category']='11-100'
df.loc[df[df.sample_ref == sampleref].nlargest(10, 'cloneCount')['category'].index,'category']='top10'
This code starts from the biggest group (10001+) and goes smaller and smaller, to include overlapping samples that might fall into the next big group.
Following this, I plotted the samples with the following code:
fig, ax = plt.subplots(figsize=(15,7))
df.groupby(['Sample','category']).sum()['cloneFraction'].unstack().plot(ax=ax, kind="bar", stacked=True)
plt.xticks(rotation=0)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], title='Clonotype',bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
And here are the results:
I hope this helps anyone struggling with the same issue!
I´m looking to add a specific range of values to the x-axis of my plot and increase the length of this axis.
I change the range of the values of my x-axis; however, the values keep in a specific range.
Besides, I tried to increase the length of the x-axis but I failed again.
For now, I´m only plotting an empty graph, because a need to set the specifications for the axis.
Here is part of the code to the plot:
fig1, ax = plt.subplots()
ax.set_xlim(1, 1200)
ax.set_ylim(-800, 200)
ax.set_box_aspect(1)
plt.show()
This code gives me a plot square with the range of the:
x-axis = 0-200-400...1200,
I´m looking for:
x-axis = 0-50-100-150...1200
Also, I need to change the shape of the plot: square to a rectangular, where the x-axis increases the length.
Any suggestion or comment is welcome!
Thank!
plt.figure(figsize=(15,2))
Use this at first line to set the size of your plot. As you want to increase x-axis, then see that x>y in figsize parameter.
l1=np.arange(0,1250,50)
plt.xticks(l1)
Use the above code after setting y limits to set the xticks in range of 0-1200 with gap of 50.
``
You can change the size (and therefore the shape) of a pyplot figure like this:
fig1.set_size_inches(10, 8)
As for the ticks on the axis, this post gives a pretty in-depth answer on how to customize those.
I have a list of times I want to use with plotly to make a line graph. The Y values work perfectly, but the x's are just spaced out by 1. I would like to make them be spaced out by 5 instead, but I don't know how to do that. I saw here that you can use a range, but that range seems to be whole numbers and an interval of 1, so how do I get the x to increase by 5 for every Y value? I can't put a for loop here to do that, so I'm not sure how this would work. Any help? Thanks!
I'm not clear on what you want to achieve, but if you only want a specific spacing between x axis labels you can use tick0 and dtick for that purpose:
xaxis=dict(
tick0=0,
dtick=5,
...
),
it would help if you show what you have and what you want for example.
I tried searching for something similar, and the closest thing I could find was this which helped me to extract and manipulate the data, but now I can't figure out how to re-plot the histogram. I have some array of voltages, and I have first plotted a histogram of occurrences of those voltages. I want to instead make a histogram of events per hour ( so the y-axis of a normal histogram divided by the number of hours I took data ) and then re-plot the histogram with the manipulated y data.
I have an array which contains the number of events per hour ( composed of the original y axis from pyplot.hist divided by the number of hours data was taken ), and the bins from the histogram. I have composed that array using the following code ( taken from the answer linked above ):
import numpy
import matplotlib.pyplot as pyplot
mydata = numpy.random.normal(-15, 1, 500) # this seems to have to be 'uneven' on either side of 0, otherwise the code looks fine. FYI, my actual data is all positive
pyplot.figure(1)
hist1 = pyplot.hist(mydata, bins=50, alpha=0.5, label='set 1', color='red')
hist1_flux = [hist1[0]/5.0, 0.5*(hist1[1][1:]+hist1[1][:-1])]
pyplot.figure(2)
pyplot.bar(hist1_flux[1], hist1_flux[0])
This code doesn't exactly match what's going on in my code; my data is composed of 1000 arrays of 1000 data points each ( voltages ). I have made histograms of that, which gives me number of occurrences of a given voltage range ( or bin width ). All I want to do is re-plot a histogram of the number of events per hour (so yaxis of the histogram / 5 hours) with the same original bin width, but when I divide hist1[0]/5 and replot in the above way, the 'bin width' is all wrong.
I feel like there must be an easier way to do this, rather than manually replotting my own histograms.
Thanks in advance, and I'm really sorry if I've missed something obvious.
The problem, illustrated in the output of my sample code AND my original data is as follows:
Upper plots: code snippet output.
Lower plots: My actual data.
It's because the bar function takes an argument width, which is by default 0.8 (plt.bar(left, height, width=0.8, bottom=None, hold=None, **kwargs)), so you need to change it to the distance between two bars:
pyplot.bar(hist1_flux[1], hist1_flux[0],
width=hist1_flux[1][1] - hist1_flux[1][0])
I am plotting some columns of a csv using Pandas/Matplotlib. The index column is the time in seconds (which has very high number).
For example:
401287629.8
401287630.8
401287631.7
401287632.8
401287633.8
401287634.8
I need this to be printed as my xticklabel when i plot. But it is changing the number format as shown below:
plt.figure()
ax = dfPlot.plot()
legend = ax.legend(loc='center left', bbox_to_anchor=(1,0.5))
labels = ax.get_xticklabels()
for label in labels:
label.set_rotation(45)
label.set_fontsize(10)
I couldn't find a way for the xticklabel to print the exact value rather than shortened version of it.
This is essentially the same problem as How to remove relative shift in matplotlib axis
The solution is to tell the formatter to not use an offset
ax.get_xaxis().get_major_formatter().set_useOffset(False)
Also related:
useOffset=False in config file?
https://github.com/matplotlib/matplotlib/issues/2400
https://github.com/matplotlib/matplotlib/pull/2401
If it's not rude of me to point out, you're asking for a great deal of precision from a single chart. Your sample data shows a six-second difference over two times that are both over twelve and a half-years long.
You have to cut your cloth to your measure on this one. If you want to keep the years, you can't keep the seconds. If you want to keep the seconds, you can't have the years.