I am wondering if there is a way to visualise three variables using stacked displot. Two is relatively straightforward using x and hue arguments. For example:
tt=pd.read_csv("titanic-data.csv")
tt['Pclass'] = tt['Pclass'].astype(str)
sns.displot(x="Pclass", hue = "Embarked", multiple="stack",data=tt,height=5, aspect=1)
sns.set(font_scale=2)
Which results in following:
Now I would like to add whether a passenger has survived or not to this, for example by splitting the bars in two for every value in Pclass like in the following sketch, where bottom left rectangle for every category could be class 0 (not survived) and top right 1 (survived).
Can anyone advice how to implement this or any other sensible way of visualising three variables?
Many thanks.
Related
As the title explains, I am trying to reproduce a stacked barchart where the y-axis scale is linear but the inside fill of the plot (i.e. the stacked bars) are logarithmic and grouped in the order of 10s.
I have made this plot before on R-Studio with an in-house package, however I am trying to reproduce the plot with other programs (python) to validate and confirm my analysis.
Quick description of the data w/ more detail:
I have thousands of entries of clonal cell information. They have multiple identifiers, such as "Strain", "Sample", "cloneID", as well as a frequency value ("cloneFraction") for each clone.
This is the .head() of the dataset I am working with to give you an idea of my data
I am trying to reproduce this following plot I made with R-Studio:
this one here
This plot has the dataset divided in groups based on their frequency, with the top 10 most frequent grouped in red, followed by the next top 100, next 1000, etc etc. The y-axis has a 0.00-1.00 scale but also a 100% scale wouldn't change, they mean the same thing in this context.
This is just to get an idea and visualize if I have big clones (the top 10) and how much of the overall dataset they occupy in frequency - i.e. the bigger the red stack the larger clones I have, signifying there has been a significant clonal expansion in my sample of a few selected cells.
What I have done so far:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
%matplotlib inline
MYDATAFRAME.groupby(['Sample','cloneFraction']).size().groupby(level=0).apply(lambda x: 100 * x / x.sum()).unstack().plot(kind='bar',stacked=True, legend=None)
plt.yscale('log')
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()
And I get this plot here
Now, I realize there is no order in the stacked plot, so the most frequent aren't on top - it's just stacking in the order of the entries in my dataset (which I assume I can just fix by sorting my dataframe by the column of interest).
Other than the axis messing up and not giving my a % when I use log scale (which is a secondary issue), I can't seem/wouldn't know how to group the data entries by frequency as I mentioned above.
I have tried things such as:
temp = X.SOME_IDENTIFIER.value_counts()
temp2 = temp.head(10)
if len(temp) > 10:
temp2['remaining {0} items'.format(len(temp) - 10)] = sum(temp[10:])
temp2.plot(kind='pie')
Just to see if I could separate them in a correct way but this does not achieve what I would like (other than being a pie chart, but I changed that in my code).
I have also tried using iloc[n:n] to select specific entries, but I can't seem to get that working either, as I get errors when I try adding it to the code I've used above to plot my graph - and if I use it without the other fancy stuff in the code (% scale, etc) it gets confused in the stacked barplot and just plots the top 10 out of all the 4 samples in my data, rather than the top 10 per sample. I also wouldn't know how to get the next 100, 1000, etc.
If you have any suggestions and can help in any way, that would be much appreciated!
Thanks
I fixed what I wanted to do with the following:
I created a new column with the category my samples fall in, base on their value (i.e. if they're the top 10 most frequent, next 100, etc etc).
df['category']='10001+'
for sampleref in df.sample_ref.unique().tolist():
print(f'Setting sample {sampleref}')
df.loc[df[df.sample_ref == sampleref].nlargest(10000, 'cloneCount')['category'].index,'category']='1001-10000'
df.loc[df[df.sample_ref == sampleref].nlargest(1000, 'cloneCount')['category'].index,'category']='101-1000'
df.loc[df[df.sample_ref == sampleref].nlargest(100, 'cloneCount')['category'].index,'category']='11-100'
df.loc[df[df.sample_ref == sampleref].nlargest(10, 'cloneCount')['category'].index,'category']='top10'
This code starts from the biggest group (10001+) and goes smaller and smaller, to include overlapping samples that might fall into the next big group.
Following this, I plotted the samples with the following code:
fig, ax = plt.subplots(figsize=(15,7))
df.groupby(['Sample','category']).sum()['cloneFraction'].unstack().plot(ax=ax, kind="bar", stacked=True)
plt.xticks(rotation=0)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], title='Clonotype',bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
And here are the results:
I hope this helps anyone struggling with the same issue!
I have a pandas dataframe and want to plot one value versus another, based on a particular field.
So I have 5 different types in 'Topic' and I want to plot each. Code as below at present.
dfCombinedToPlot.groupby('Topic').plot(x='DataValue', y='CasesPer100kPop', style='o')
# plt.title() Want this to equal "'Topic' vs number of cases"
# plt.xlabel() Want this to equal 'Topic'
# plt.ylabel()
plt.show()
I have 3 questions.
1. Can I add a title/xlabel to each of these which matches that from the Topic? So if the topic was "Asthma", I want the title/x label to be "Asthma", and then the next one to be "Bronchitis", and so on.
2. I want to add these on the same plot if possible, I will decide how many looks well when I see them. How do I do this?
3. (Bonus question!!) Can I easily add a "best fit" line to each plot?
Thanks all.
groupby.plot method returns a list of axes you can use to set titles, for example if there are only two axes:
axes = dfCombinedToPlot.groupby('Topic').plot(x='DataValue', y='CasesPer100kPop', style='o')
titles = ['Asthma', 'Bronchitis']
count = 0
for ax in axes:
ax.set_title(f'{titles[count]}')
ax.set_xlabel(f'{titles[count]}')
count += 1
plt.show()
To do the best fit, I assume is a linear regression, so it might worth checking regplot
I am unsure how to customize scatterplot marker styles in Plotly scatterplots.
Specifically, I have a column predictions that is 0 or 1 (1 represents an unexpected value) and even though I used the symbol parameter in px.scatter_3d to indicate the unexpected value through varying point shape (diamond for 1 and circle for 0), the difference is very subtle and I want it to be more dramatic. I was envisioning something like below (doesn't need to be exactly this), but something along the lines of the diamond shaped points have a different outline colors or an additional shape/bubble around it. How would I do this?
Additionally, I have a set column which can take on one of two values, set A or set B. I used the color parameter inside px.scatter_3d and made that equal to set so the points are colored according to which set it came from. While it is doing what I asked, I don't want the colors to be blue and red, but any two colors I specify. How would I be able to this (let's say I want the colors to be blue and orange instead)? Thank you so much!
Here is the code I used:
fig = px.scatter_3d(X_combined, x='x', y='y', z='z',
color='set', symbol='predictions', opacity=0.7)
fig.update_traces(marker=dict(size=12,
line=dict(width=5,
color='Black')),
selector=dict(mode='markers'))
You can use multiple go.Scatter3d() statements and gather them in a list to format each and every segment or extreme values more or less exactly as you'd like. This can be a bit more demanding than using px.scatter_3d(), but it will give you more control. The following plot is produced by the snippet below:
Plot:
Code:
import plotly.graph_objects as go
import numpy as np
import pandas as pd
# sample data
t = np.linspace(0, 10, 50)
x, y, z = np.cos(t), np.sin(t), t
# plotly data
data=[go.Scatter3d(x=[x[2]], y=[y[2]], z=[z[2]],mode='markers', marker=dict(size=20), opacity=0.8),
go.Scatter3d(x=[x[26]], y=[y[26]], z=[z[26]],mode='markers', marker=dict(size=30), opacity=0.3),
go.Scatter3d(x=x, y=y, z=z,mode='markers')]
fig = go.Figure(data)
fig.show()
How you identify the different segmens, whether it be max or min values will be entirely up to you. Anyway, I hope this approach will be useful!
This question already has answers here:
Pandas DataFrame Bar Plot - Plot Bars Different Colors From Specific Colormap
(3 answers)
Closed 4 years ago.
I would like to change the default color scheme of my pandas plot. I tried with different color schemes through cmap pandas parameter, but when I change it, all bars of my barplot get the same color.
The code I tried is the following one:
yearlySalesGenre = df1.groupby('Genre').Global_Sales.sum().sort_values()
fig = plt.figure()
ax2 = plt.subplot()
yearlySalesGenre.plot(kind='bar',ax=ax2, sort_columns=True, cmap='tab20')
plt.show(fig)
And the data that I plot (yearlySalesGenre) is a pandas Series type:
Genre
Strategy 174.50
Adventure 237.69
Puzzle 243.02
Simulation 390.42
Fighting 447.48
Racing 728.90
Misc 803.18
Platform 828.08
Role-Playing 934.40
Shooter 1052.94
Sports 1249.47
Action 1745.27
Using tab20 cmap I get the following plot:
I get all bars with the first color of all the tab20 scheme. What I am doing wrong?
Note that if I use the default color scheme of pandas plot, it properly displays all bars with different colors, but the thing is that I want to use a particular color scheme.
As posted, it's a duplicated answer. Just in case, the answer is that pandas makes color schemes based on different columns, not in rows. So to use different colors you can transpose the data + some other stuff (duplicated link), or directly use the matplotlib.pyplot plotting that allows more flexibility (in my case):
plt.bar(range(len(df)), df, color=plt.cm.tab20(np.arange(len(df))))
Maybe this is what you want:
df2.T.plot( kind='bar', sort_columns=True, cmap='tab20')
I think the problem you have is that you only have one series. Pandas plot bar will plot separate series (columns) each with its own color, and separate each each bar based on the index.
By using .T, the series in your data become multiple columns but within only one index. I am sure you can play with the legend to get a better display.
I'm currently trying to plot 7 days with varying small to large numbers.
The first set of data may look like this
dates = ['2018-09-20', '2018-09-21', '2018-09-22', '2018-09-23', '2018-09-24', '2018-09-25', '2018-09-26', '2018-09-27']
values = [107.660514, 107.550403, 107.435041, 107.435003, 107.574965, 107.449961, 107.650052, 107.649974]
vs another set of data may have the same dates, but the values may be much small incremental changes
dates = ['2018-09-20', '2018-09-21', '2018-09-22', '2018-09-23', '2018-09-24', '2018-09-25', '2018-09-26', '2018-09-27']
values = [0.849215, 0.849655, 0.849655, 0.851095, 0.850885, 0.850135, 0.851203, 0.851865]
When I use this
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
plt.plot_date(x=dates, y=values, fmt="r--")
plt.ylabel(c)
plt.grid(True)
plt.savefig('static/%s.png' % c)
The resulting image for the 1st set of values comes out as a dashed lined connecting the days to the dots. But the 2nd set of data makes a image of 7 parallel lines stacked on top of each other.
Should I be plotting this differently?
I assume you would like a comparison between two set of data you provided.
However, with such gap between both sets of data, it could be fairly unclear if you want to show both sets in a same plot.
You could use plt.subplots() to do that, and you'll probably get a plot like this
Or a better way is just showing two plots separately.. And you'll get a much clearer plot.
If you want to just show two plots, you can do something like this.