I want to plot graph with a certain condition without manipulating my data frame.
For example, I created a countplot with a data frame that has a bunch of x-values that are less than 100, and in the countplot, those less than 100 comes up as "no-bar", and it's taking up space. So I want to just get rid of those empty (count < 100).
I tried to create another data frame with only count values higher than 100, but I wanted to know if there is a simpler/cleaner way to plot a graph, rather than creating a whole data frame.
plt.figure(figsize=(10,50))
plt.ylim(100,500)
sns.countplot(data=df, x='brand')
From this code, I see many empty bars caused by counting values less than 100, as xlim is set to 100-500.
import matplotlib.pyplot as plt
import seaborn as sns
plot_data = df.groupby('brand').size().reset_index(name='count').query('count>=100')
plt.figure(figsize=(10,50))
plt.ylim(100,500)
sns.barplot(data=plot_data, x='brand', y='count')
Related
As the title explains, I am trying to reproduce a stacked barchart where the y-axis scale is linear but the inside fill of the plot (i.e. the stacked bars) are logarithmic and grouped in the order of 10s.
I have made this plot before on R-Studio with an in-house package, however I am trying to reproduce the plot with other programs (python) to validate and confirm my analysis.
Quick description of the data w/ more detail:
I have thousands of entries of clonal cell information. They have multiple identifiers, such as "Strain", "Sample", "cloneID", as well as a frequency value ("cloneFraction") for each clone.
This is the .head() of the dataset I am working with to give you an idea of my data
I am trying to reproduce this following plot I made with R-Studio:
this one here
This plot has the dataset divided in groups based on their frequency, with the top 10 most frequent grouped in red, followed by the next top 100, next 1000, etc etc. The y-axis has a 0.00-1.00 scale but also a 100% scale wouldn't change, they mean the same thing in this context.
This is just to get an idea and visualize if I have big clones (the top 10) and how much of the overall dataset they occupy in frequency - i.e. the bigger the red stack the larger clones I have, signifying there has been a significant clonal expansion in my sample of a few selected cells.
What I have done so far:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
%matplotlib inline
MYDATAFRAME.groupby(['Sample','cloneFraction']).size().groupby(level=0).apply(lambda x: 100 * x / x.sum()).unstack().plot(kind='bar',stacked=True, legend=None)
plt.yscale('log')
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()
And I get this plot here
Now, I realize there is no order in the stacked plot, so the most frequent aren't on top - it's just stacking in the order of the entries in my dataset (which I assume I can just fix by sorting my dataframe by the column of interest).
Other than the axis messing up and not giving my a % when I use log scale (which is a secondary issue), I can't seem/wouldn't know how to group the data entries by frequency as I mentioned above.
I have tried things such as:
temp = X.SOME_IDENTIFIER.value_counts()
temp2 = temp.head(10)
if len(temp) > 10:
temp2['remaining {0} items'.format(len(temp) - 10)] = sum(temp[10:])
temp2.plot(kind='pie')
Just to see if I could separate them in a correct way but this does not achieve what I would like (other than being a pie chart, but I changed that in my code).
I have also tried using iloc[n:n] to select specific entries, but I can't seem to get that working either, as I get errors when I try adding it to the code I've used above to plot my graph - and if I use it without the other fancy stuff in the code (% scale, etc) it gets confused in the stacked barplot and just plots the top 10 out of all the 4 samples in my data, rather than the top 10 per sample. I also wouldn't know how to get the next 100, 1000, etc.
If you have any suggestions and can help in any way, that would be much appreciated!
Thanks
I fixed what I wanted to do with the following:
I created a new column with the category my samples fall in, base on their value (i.e. if they're the top 10 most frequent, next 100, etc etc).
df['category']='10001+'
for sampleref in df.sample_ref.unique().tolist():
print(f'Setting sample {sampleref}')
df.loc[df[df.sample_ref == sampleref].nlargest(10000, 'cloneCount')['category'].index,'category']='1001-10000'
df.loc[df[df.sample_ref == sampleref].nlargest(1000, 'cloneCount')['category'].index,'category']='101-1000'
df.loc[df[df.sample_ref == sampleref].nlargest(100, 'cloneCount')['category'].index,'category']='11-100'
df.loc[df[df.sample_ref == sampleref].nlargest(10, 'cloneCount')['category'].index,'category']='top10'
This code starts from the biggest group (10001+) and goes smaller and smaller, to include overlapping samples that might fall into the next big group.
Following this, I plotted the samples with the following code:
fig, ax = plt.subplots(figsize=(15,7))
df.groupby(['Sample','category']).sum()['cloneFraction'].unstack().plot(ax=ax, kind="bar", stacked=True)
plt.xticks(rotation=0)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], title='Clonotype',bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
And here are the results:
I hope this helps anyone struggling with the same issue!
I'm currently trying to plot 7 days with varying small to large numbers.
The first set of data may look like this
dates = ['2018-09-20', '2018-09-21', '2018-09-22', '2018-09-23', '2018-09-24', '2018-09-25', '2018-09-26', '2018-09-27']
values = [107.660514, 107.550403, 107.435041, 107.435003, 107.574965, 107.449961, 107.650052, 107.649974]
vs another set of data may have the same dates, but the values may be much small incremental changes
dates = ['2018-09-20', '2018-09-21', '2018-09-22', '2018-09-23', '2018-09-24', '2018-09-25', '2018-09-26', '2018-09-27']
values = [0.849215, 0.849655, 0.849655, 0.851095, 0.850885, 0.850135, 0.851203, 0.851865]
When I use this
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
plt.plot_date(x=dates, y=values, fmt="r--")
plt.ylabel(c)
plt.grid(True)
plt.savefig('static/%s.png' % c)
The resulting image for the 1st set of values comes out as a dashed lined connecting the days to the dots. But the 2nd set of data makes a image of 7 parallel lines stacked on top of each other.
Should I be plotting this differently?
I assume you would like a comparison between two set of data you provided.
However, with such gap between both sets of data, it could be fairly unclear if you want to show both sets in a same plot.
You could use plt.subplots() to do that, and you'll probably get a plot like this
Or a better way is just showing two plots separately.. And you'll get a much clearer plot.
If you want to just show two plots, you can do something like this.
I'm using matplotlib in python to create heatmaps for different clusters I've created using k-means clustering. Right now I'm able to produce this figure:
But I want the number of rows in each cluster reflected in the size of the heatmap, instead of them all being scaled to the same size. Is GridSpec the right way to do this? It's the only thing I can find trying to Google the solution, but it seems more suited to situations where you have subplots on a grid and you want a certain subplot to span more than one row or column on the grid. In this situation, I would be creating a grid with thousands of rows and telling each subplot to span hundreds of them. Is this still the best way to do it?
Edit: In case my question isn't clear, I'm ultimately trying to create a figure like this one. Notice how it's easy to see in the left figure that cluster E is larger than cluster F:
GridSpec has an argument height_ratios. You can set it to a list of the vertical shape of the heatmaps.
import numpy as np
import matplotlib.pyplot as plt
data = [np.random.rand(n,8) for n in [3,7,10,4]]
fig, axes = plt.subplots(nrows=len(data),
gridspec_kw=dict(height_ratios=[d.shape[0] for d in data]))
for ax, d in zip(axes, data):
ax.imshow(d)
ax.tick_params(labelbottom=False)
plt.show()
I have some data, based on which I am trying to build a countplot in seaborn. So I do something like this:
data = np.hstack((np.random.normal(10, 5, 10000), np.random.normal(30, 8, 10000))).astype(int)
plot_ = sns.countplot(data)
and get my countplot:
The problem is that ticks on the x-axis are too dense (which makes them useless). I tried to decrease the density with plot_.xticks=np.arange(0, 40, 10) but it didn't help.
Also is there a way to make the plot in one color?
Tick frequency
There seem to be multiple issues here:
You are using the = operator while using plt.xticks. You should use a function call instead (but not here; read point 2 first)!
seaborn's countplot returns an axes-object, not a figure
you need to use the axes-level approach of changing x-ticks (which is not plt.xticks())
Try this:
for ind, label in enumerate(plot_.get_xticklabels()):
if ind % 10 == 0: # every 10th label is kept
label.set_visible(True)
else:
label.set_visible(False)
Colors
I think the data-setup is not optimal here for this type of plot. Seaborn will interpret each unique value as new category and introduce a new color. If i'm right, the number of colors / and x-ticks equals the number of np.unique(data).
Compare your data to seaborn's examples (which are all based on data which can be imported to check).
I also think working with seaborn is much easier using pandas dataframes (and not numpy arrays; i often prepare my data in a wrong way and subset-selection needs preprocessing; dataframes offer more). I think most of seaborn's examples use this data-input.
even though this has been answered a while ago, adding another perhaps simpler alternative that is more flexible.
you can use an matplotlib axis tick locator to control which ticks will be shown.
in this example you can use LinearLocator to achieve the same thing:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.ticker as ticker
data = np.hstack((np.random.normal(10, 5, 10000), np.random.normal(30, 8, 10000))).astype(int)
plot_ = sns.countplot(data)
plot_.xaxis.set_major_locator(ticker.LinearLocator(10))
Since you have tagged matplotlib, one solution different from setting the ticks visible True/False is to plot every nth label as following
fig = plt.figure(); np.random.seed(123)
data = np.hstack((np.random.normal(10, 5, 10000), np.random.normal(30, 8, 10000))).astype(int)
plot_ = sns.countplot(data)
fig.canvas.draw()
new_ticks = [i.get_text() for i in plot_.get_xticklabels()]
plt.xticks(range(0, len(new_ticks), 10), new_ticks[::10])
As a slight modification of the accepted answer, we typically select labels based on their value (and not index), e.g. to display only values which are divisible by 10, this would work:
for label in plot_.get_xticklabels():
if np.int(label.get_text()) % 10 == 0:
label.set_visible(True)
else:
label.set_visible(False)
I have a 100.000.000 sample dataset and I want to make a histogram with pyplot. But reading this large file drains my memory critically (cursor not moving anymore, ...), so I'm looking for ways to 'help' pyplot.hist. I was thinking breaking up the file into several smaller files might help. But I wouldn't know how to combine them afterwards.
you can combine the output of pyplot.hist, or as #titusjan suggested numpy.histogram, as long as you keep your bins fixed each time you call it. For example:
import matplotlib.pyplot as plt
import numpy as np
# Generate some fake data
data=np.random.rand(1000)
# The fixed bins (change depending on your data)
bins=np.arange(0,1.1,0.1)
sub_hist = [], []
# Split into 10 sub histograms
for i in np.arange(0,1000,10):
sub_hist_temp, bins_out = np.histogram(data[i:i+10],bins=bins)
sub_hist.append(sub_hist_temp)
# Sum the histograms
hist_sum = np.array(sub_hist).sum(axis=0)
# Plot the new summed data, using plt.bar
fig=plt.figure()
ax1=fig.add_subplot(211)
ax1.bar(bins[:-1],hist_sum,width=0.1) # Change width depending on your bins
# Plot the histogram of all data to check
ax2=fig.add_subplot(212)
hist_all, bins_out, patches = all=ax2.hist(data,bins=bins)
fig.savefig('histsplit.png')