Extra set of bars on plot in Pandas? - python

I want to create a plot using Pandas to show the standard deviations of item prices on specific week days (in my case there are 6 relevant days of the week, each shown as 0-5 on the x axis).
It seems to work however there is another set of smaller bars next to each standard deviation bar that is literally also valued at 0-5.
I think this means that I'm also accidentally also plotting the day of the week.
How can I get rid of these smaller bars and only show the standard deviation bars?
sales_std=sales_std[['WeekDay','price']].groupby(['WeekDay']).std()
.reset_index()
Here is where I try to plot the graph:
p = sales_std.plot(figsize=
(15,5),legend=False,kind="bar",rot=45,color="orange",fontsize=16,
yerr=sales_std);
p.set_title("Standard Deviation", fontsize=18);
p.set_xlabel("WeekDay", fontsize=18);
p.set_ylabel("Price", fontsize=18);
p.set_ylim(0,100);
Resulting Bar Plot:

You are plotting both WeekDay and price at the same time (i.e. plotting an entire Dataframe). In order to show bars for price only, you need to plot Series given WeekDay as an index (so no reset_index() is required after groupby()).
# you don't need `reset_index()` in your code
sales_std=sales_std[['WeekDay','price']].groupby(['WeekDay']).std()
sales_std['price'].plot(kind='bar')
Note: I intentionally omitted graph-styling parts of your code to focus on fixing the issue.

Related

Python stacked barchart where y-axis scale is linear but the bar fill is logarithmic in the order of 10s

As the title explains, I am trying to reproduce a stacked barchart where the y-axis scale is linear but the inside fill of the plot (i.e. the stacked bars) are logarithmic and grouped in the order of 10s.
I have made this plot before on R-Studio with an in-house package, however I am trying to reproduce the plot with other programs (python) to validate and confirm my analysis.
Quick description of the data w/ more detail:
I have thousands of entries of clonal cell information. They have multiple identifiers, such as "Strain", "Sample", "cloneID", as well as a frequency value ("cloneFraction") for each clone.
This is the .head() of the dataset I am working with to give you an idea of my data
I am trying to reproduce this following plot I made with R-Studio:
this one here
This plot has the dataset divided in groups based on their frequency, with the top 10 most frequent grouped in red, followed by the next top 100, next 1000, etc etc. The y-axis has a 0.00-1.00 scale but also a 100% scale wouldn't change, they mean the same thing in this context.
This is just to get an idea and visualize if I have big clones (the top 10) and how much of the overall dataset they occupy in frequency - i.e. the bigger the red stack the larger clones I have, signifying there has been a significant clonal expansion in my sample of a few selected cells.
What I have done so far:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
%matplotlib inline
MYDATAFRAME.groupby(['Sample','cloneFraction']).size().groupby(level=0).apply(lambda x: 100 * x / x.sum()).unstack().plot(kind='bar',stacked=True, legend=None)
plt.yscale('log')
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()
And I get this plot here
Now, I realize there is no order in the stacked plot, so the most frequent aren't on top - it's just stacking in the order of the entries in my dataset (which I assume I can just fix by sorting my dataframe by the column of interest).
Other than the axis messing up and not giving my a % when I use log scale (which is a secondary issue), I can't seem/wouldn't know how to group the data entries by frequency as I mentioned above.
I have tried things such as:
temp = X.SOME_IDENTIFIER.value_counts()
temp2 = temp.head(10)
if len(temp) > 10:
temp2['remaining {0} items'.format(len(temp) - 10)] = sum(temp[10:])
temp2.plot(kind='pie')
Just to see if I could separate them in a correct way but this does not achieve what I would like (other than being a pie chart, but I changed that in my code).
I have also tried using iloc[n:n] to select specific entries, but I can't seem to get that working either, as I get errors when I try adding it to the code I've used above to plot my graph - and if I use it without the other fancy stuff in the code (% scale, etc) it gets confused in the stacked barplot and just plots the top 10 out of all the 4 samples in my data, rather than the top 10 per sample. I also wouldn't know how to get the next 100, 1000, etc.
If you have any suggestions and can help in any way, that would be much appreciated!
Thanks
I fixed what I wanted to do with the following:
I created a new column with the category my samples fall in, base on their value (i.e. if they're the top 10 most frequent, next 100, etc etc).
df['category']='10001+'
for sampleref in df.sample_ref.unique().tolist():
print(f'Setting sample {sampleref}')
df.loc[df[df.sample_ref == sampleref].nlargest(10000, 'cloneCount')['category'].index,'category']='1001-10000'
df.loc[df[df.sample_ref == sampleref].nlargest(1000, 'cloneCount')['category'].index,'category']='101-1000'
df.loc[df[df.sample_ref == sampleref].nlargest(100, 'cloneCount')['category'].index,'category']='11-100'
df.loc[df[df.sample_ref == sampleref].nlargest(10, 'cloneCount')['category'].index,'category']='top10'
This code starts from the biggest group (10001+) and goes smaller and smaller, to include overlapping samples that might fall into the next big group.
Following this, I plotted the samples with the following code:
fig, ax = plt.subplots(figsize=(15,7))
df.groupby(['Sample','category']).sum()['cloneFraction'].unstack().plot(ax=ax, kind="bar", stacked=True)
plt.xticks(rotation=0)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], title='Clonotype',bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
And here are the results:
I hope this helps anyone struggling with the same issue!

How can I make line chart with uneven time intervals? in FacetGrid function

I made a line chart with seaborn FacetGrid, x-axis is the column called "dato", it is indicating the date of taking samples.
I have unequal time intervals between every sample taking, thereby I want to have corresponding space between every tick on x axis. how can I do that?
and I have a problem with making the correct order for dates also, tried some code, but not working for FacetGrid.
Here is my code and graph I made
mer=sns.FacetGrid(df, col='anlegg',hue='Merd',sharey=False,ylim=(0,4000))
sns.set_style('white')
sns.set_context('paper', font_scale=1.2)
mer.map_dataframe(sns.lineplot,x='dato',y='BW',marker='o',err_style='bars')
mer.add_legend()
mer.set_axis_labels('','BW')
mer.set_titles(col_template='anlegg {col_name}')
mer.set_xticklabels(rotation=90)

Plotting multilayered grouped df with xlim

I have a plot with hourly values for 2019. When plotting with a sub-set of dates (January only) on the x-axis, my plot goes blank.
I have a DF that I group on the row-axis based on Months and Hours from the time index, for a specific column 'SE3'. The grouping looks good.
Now, I want to plot. The plot looks potentially good, but I want to zoom in on one month only. Based on another post on stackoverflow, I use set_xlim.
Then my plot does not show anything.
#Grouping of DF
df['SE3'].groupby([df.index.month, df.index.hour]).mean().round(2).head()
Picture of grouped DF1
#Plotting and setting new, shorter in time x-axis
ax=df['SE3'].groupby([df.index.month, df.index.hour]).mean().round(2).plot()
ax.set_xlim(pd.Timestamp('2019-01-01 01:00:00'), pd.Timestamp('2019-01-31 23:00:00'))
The expected result is to show the same plot, but now only for January. Instead the grap goes blank. However, the Out data shows
(737060.0416666666, 737090.9583333334), which seems to be date data.
Picture without set_xlim
enter image description here
Picture with set_xlim (empty)
enter image description here
My final aim when I understand why my plot is blank, is to show hourly averages for each month, like this:
enter image description here

Plotting a column with millions of rows

I have a data-frame with millions of rows (almost 8 million). I need to see the distribution of the values in one of the columns. This column is called 'price_per_mile'. I also have a column called 'Borough'. The final goal is doing a t-test.
First I want to see the distribution of data in 'price_per_mile', to see if data is normal and if I need to do some data cleaning. Then group-by based on five categories in 'borough' column and then do the t-test for each possible pair of boroughs.
I have tried to plot the distribution with sns.distplot() but it doesn't give me a clear plot as it seems there's a scaling of the values on the y-axis. Also, the range of values contained in 'price_per_mile' is big.
Then I tried to plot a section of values, again the plot doesn't look clear and informative enough. Scaling happens again.
result.drop(result[(result.price_per_mile <1) | (result.price_per_mile>200)].index, inplace=True)
What do I need to do to have a better-looking plot which gives me the true value of each bin and not just a normalized value?
I read the documentation for sns.distplot() but didn't find something helpful.
As per the documentation for displot (emphasis mine)
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
Which means that if you want the non-normalized histogram, you have to make sure to instruct seaborn to not plot the KDE at the same time
sns.distplot(a, kde=True, norm_hist=False)
sns.distplot(a, kde=False, norm_hist=False)

Xaxis Labels not matching Data Points - Pandas/Matplotlib

I have a TimeSeries in Pandas that I want to plot. I have 336 records in the TimeSeries. I only want to show the date/time (index of the TimeSeries) on the x-axis once per every 20 or so data points.
Here is how I am trying to do this:
stats.plot()
ax.set_xticklabels(stats.index, rotation=45 )
ax.xaxis.set_major_locator(MultipleLocator(20))
ax.xaxis.set_minor_locator(NullLocator())
ax.yaxis.set_major_locator(MultipleLocator(.075))
draw()
My x-axis show the correct number of labels (18), but these are the first 18 in the series, they are not correctly corresponding to the datapoints in the plot.
The problem is you are using set_xticklabels which sets the value of the tick labels independent of the data. The ticks are labeled sequentially from the list you pass in.
From this I can't really tell what you are trying to do, but the behavior you are seeing is the 'correct' behavior for the library (it's doing exactly what you told it to, but that isn't what you want it to do).

Categories

Resources