I'm new to using Seaborn and usually only use Matplotlib.pyplot.
With the recent COVID developments I was asked by a supervisor to put together estimates of how changes to the student population & expenses we need to fund affected student fees (I work in a college budgeting office). I've been able to put together my scenario analysis, but am now trying to visualize these results in a heatmap.
What I'd like to be able to do is have the:
x-axis be my population change rates,
y_axis be my expense change rates,
cmap be my new student fees depending on the x & y axis.
What my code is currently doing is:
x-axis is displaying the new student fee category (not sure how to describe this - see picture)
y-axis is displaying the population change and expense change (population, expenses)
cmap is displaying accurately
Essentially, my code is stacking each scenario on top of the others along the y-axis.
Here is a picture of what is currently being produced, which is not correct:
I've attached a link to a Colab Jupyter notebook with my code, and below is a snippet of the section giving me problems.
# Create Pandas DF of Scenario Analysis
df = pd.DataFrame(list(zip(Pop, Exp, NewStud, NewTotal)),
index = [i for i in range(0,len(NewStud))],
columns=['Population_Change', 'Expense_Change', 'New_Student_Activity_Fee', 'New_Total_Fee'])
# Group this scenario analysis
df = df.groupby(['Population_Change', 'Expense_Change'], sort=False).max()
# Create Figure
fig = plt.figure(figsize=(15,8))
ax = plt.subplot(111)
# Drop New Student Activity Fee Column. Analyze Only New Total Fee
df = df.drop(['New_Student_Activity_Fee'], axis=1)
########################### Not Working As Desired
sb.heatmap(df)
###########################
Your DataFrame is not in the right shape for seaborn.heatmap(). For example, as a result of the groupby operation, you have Population_Change and Expense_Change as a MultiIndex, which would only be used for labelling by the plotting function.
So instead of the groupby, first drop the superfluous column, and then do this:
df = df.pivot(index='Expense_Change', columns='Population_Change', values='New_Total_Fee')
Then seaborn.heatmap(df) should work as expected.
Related
I’ve been trying for a while to set custom tooltips on a 3d surface plot, but cannot figure it out. I can do something very simple, like make the tooltip the same for each point, but I’m having trouble putting different values for each point in the tooltip, when the fields aren't being graphed.
In my example, I have a dataset of 53 rows (weeks) and 7 columns (days of the week) that I’m graphing on a 3d surface plot, by passing the dataframe in the Z parameter. It’s a year’s worth of data, so each day has its own numeric value that’s being graphed. I’m trying to label each point with the actual date (hence the custom tooltip, since I'm not passing the date itself to the graph), but cannot seem to align the tooltip values correctly.
I tried a simple example to create a "tooltip array" of the same shape as the dataframe, but when I test whether I’m getting the shape right, by using a repeated word, I get an even weirder error where it uses the character values in the word as tooltips (e.g., c or _). Does anyone have any thoughts or suggestions? I can post more code, but tried to replicate my error with a simpler example.
labels=np.array([['test_label']*7]*53)
fig = go.Figure(data=[
go.Surface(z=Z, text=labels, hoverinfo='text'
)],)
fig.show()
We have created sample data similar to the data provided in the image. I created a data frame with randomly generated values for the dates of one consecutive year, added the week number and day number, and formed it into Z data. I have also added a date only data column. So your code will make the hover text display the date.
import numpy as np
import plotly.graph_objects as go
import pandas as pd
df = pd.DataFrame({'date':pd.to_datetime(pd.date_range('2021-01-01','2021-12-31',freq='1d')),'value':np.random.rand(365)})
df['day_of_week'] = df['date'].dt.weekday
df['week'] = df['date'].dt.isocalendar().week
df['date2'] = df['date'].dt.date
Z = df[['week','day_of_week','value']].pivot(index='week', columns='day_of_week')
labels = df[['week','day_of_week','date2']].pivot(index='week', columns='day_of_week').fillna('')
fig = go.Figure(data=[
go.Surface(z=Z,
text=labels,
hoverinfo='text'
)]
)
fig.update_layout(autosize=False, width=800, height=600)
fig.show()
As the title explains, I am trying to reproduce a stacked barchart where the y-axis scale is linear but the inside fill of the plot (i.e. the stacked bars) are logarithmic and grouped in the order of 10s.
I have made this plot before on R-Studio with an in-house package, however I am trying to reproduce the plot with other programs (python) to validate and confirm my analysis.
Quick description of the data w/ more detail:
I have thousands of entries of clonal cell information. They have multiple identifiers, such as "Strain", "Sample", "cloneID", as well as a frequency value ("cloneFraction") for each clone.
This is the .head() of the dataset I am working with to give you an idea of my data
I am trying to reproduce this following plot I made with R-Studio:
this one here
This plot has the dataset divided in groups based on their frequency, with the top 10 most frequent grouped in red, followed by the next top 100, next 1000, etc etc. The y-axis has a 0.00-1.00 scale but also a 100% scale wouldn't change, they mean the same thing in this context.
This is just to get an idea and visualize if I have big clones (the top 10) and how much of the overall dataset they occupy in frequency - i.e. the bigger the red stack the larger clones I have, signifying there has been a significant clonal expansion in my sample of a few selected cells.
What I have done so far:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
%matplotlib inline
MYDATAFRAME.groupby(['Sample','cloneFraction']).size().groupby(level=0).apply(lambda x: 100 * x / x.sum()).unstack().plot(kind='bar',stacked=True, legend=None)
plt.yscale('log')
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()
And I get this plot here
Now, I realize there is no order in the stacked plot, so the most frequent aren't on top - it's just stacking in the order of the entries in my dataset (which I assume I can just fix by sorting my dataframe by the column of interest).
Other than the axis messing up and not giving my a % when I use log scale (which is a secondary issue), I can't seem/wouldn't know how to group the data entries by frequency as I mentioned above.
I have tried things such as:
temp = X.SOME_IDENTIFIER.value_counts()
temp2 = temp.head(10)
if len(temp) > 10:
temp2['remaining {0} items'.format(len(temp) - 10)] = sum(temp[10:])
temp2.plot(kind='pie')
Just to see if I could separate them in a correct way but this does not achieve what I would like (other than being a pie chart, but I changed that in my code).
I have also tried using iloc[n:n] to select specific entries, but I can't seem to get that working either, as I get errors when I try adding it to the code I've used above to plot my graph - and if I use it without the other fancy stuff in the code (% scale, etc) it gets confused in the stacked barplot and just plots the top 10 out of all the 4 samples in my data, rather than the top 10 per sample. I also wouldn't know how to get the next 100, 1000, etc.
If you have any suggestions and can help in any way, that would be much appreciated!
Thanks
I fixed what I wanted to do with the following:
I created a new column with the category my samples fall in, base on their value (i.e. if they're the top 10 most frequent, next 100, etc etc).
df['category']='10001+'
for sampleref in df.sample_ref.unique().tolist():
print(f'Setting sample {sampleref}')
df.loc[df[df.sample_ref == sampleref].nlargest(10000, 'cloneCount')['category'].index,'category']='1001-10000'
df.loc[df[df.sample_ref == sampleref].nlargest(1000, 'cloneCount')['category'].index,'category']='101-1000'
df.loc[df[df.sample_ref == sampleref].nlargest(100, 'cloneCount')['category'].index,'category']='11-100'
df.loc[df[df.sample_ref == sampleref].nlargest(10, 'cloneCount')['category'].index,'category']='top10'
This code starts from the biggest group (10001+) and goes smaller and smaller, to include overlapping samples that might fall into the next big group.
Following this, I plotted the samples with the following code:
fig, ax = plt.subplots(figsize=(15,7))
df.groupby(['Sample','category']).sum()['cloneFraction'].unstack().plot(ax=ax, kind="bar", stacked=True)
plt.xticks(rotation=0)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], title='Clonotype',bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
And here are the results:
I hope this helps anyone struggling with the same issue!
So I am plotting to visualize the education difference between genders in a given dataset.
I group the employees by gender, summing their years_in_education with this code
df1 = df[["gender","years_in_education"]] #creating a sub-dataframe with only the columns of gender and hourly wage
staff4=df1.groupby(['gender']).sum() #grouping the data frame by gender and assigning it to a new variable 'staff4'
staff4.head() #creating a visual of the grouped data for inspection
Then I use a bar chart to plot the difference with this code >>
my_plot = staff4.T.plot(kind='bar',title="Education difference between Genders") #creating the parameters to plot the graph
The graph comes out as this >>
But I observe the that scale of the y-axis is outrageous as the highest year in employment by the data is 30. I intend to adjust the scale to range from 0 - 30. I did that using my_plot.set_ylim([0,30]) and the result of that was >>
This graph is not reflective of the data as shown in the first. What can I do to change that?
Any ideas pls? How can I also change the orientation of the label on the y-axis.
I have a plot with hourly values for 2019. When plotting with a sub-set of dates (January only) on the x-axis, my plot goes blank.
I have a DF that I group on the row-axis based on Months and Hours from the time index, for a specific column 'SE3'. The grouping looks good.
Now, I want to plot. The plot looks potentially good, but I want to zoom in on one month only. Based on another post on stackoverflow, I use set_xlim.
Then my plot does not show anything.
#Grouping of DF
df['SE3'].groupby([df.index.month, df.index.hour]).mean().round(2).head()
Picture of grouped DF1
#Plotting and setting new, shorter in time x-axis
ax=df['SE3'].groupby([df.index.month, df.index.hour]).mean().round(2).plot()
ax.set_xlim(pd.Timestamp('2019-01-01 01:00:00'), pd.Timestamp('2019-01-31 23:00:00'))
The expected result is to show the same plot, but now only for January. Instead the grap goes blank. However, the Out data shows
(737060.0416666666, 737090.9583333334), which seems to be date data.
Picture without set_xlim
enter image description here
Picture with set_xlim (empty)
enter image description here
My final aim when I understand why my plot is blank, is to show hourly averages for each month, like this:
enter image description here
I need to plot a series of boxplots, based on results of numerical air quality model. Since this is a significant amount of data, I trigger calculation of aggregates (min, max, quartiles, etc.) every time when new model results become ready and store them in PostgreSQL. For visualization purpose I load the aggregates into pandas and I plot them using dash. I am able to plot line plots of timeseries, however I would like to get something like this example, but also interactive.
As I went through plotly examples, it looks like it always require the raw data for ploting boxplots ( https://plot.ly/python/box-plots/#basic-box-plot ). I really enjoy the concept of presentation and logic separation. Is it possible to get a plotly box plot based on aggregated data?
You can provide your aggreate values to a Plotly boxplot in Python by providing it in the following format:
plotly.graph_objs.Box(y=[val_min,
val_lower_box,
val_lower_box,
val_median,
val_upper_box,
val_upper_box,
val_max])
e.g.
import plotly
plotly.offline.init_notebook_mode()
val_min = 1
val_lower_box = 2
val_median = 3
val_upper_box = 4.5
val_max = 6
box_plot = plotly.graph_objs.Box(y=[val_min,
val_lower_box,
val_lower_box,
val_median,
val_upper_box,
val_upper_box,
val_max])
plotly.offline.iplot([box_plot])
gives you